Investigation of part-level perceptual music similarity by large-scale listening test

Hashizume, Yuka; Toda, Tomoki

doi:10.1108/ATSIP-07-2025-0065

This study investigates perceptual similarity at two levels: music tracks (track-level) and the individual instrumental parts that compose them (part-level). A previous work performed a study on perceptual part-level similarity toward developing a model that estimates part-level similarity. An ABX-style listening test with 632 participants was conducted, which evaluated similarity at both levels from the perspectives of timbre, melody, rhythm and overall. Although a previous work contributed some knowledge from the evaluations, further insights are needed to support the development of future estimation models. Specifically, important questions remain regarding the correspondence between track- and part-level similarity, the generalizability of findings across multiple models, and the validity of the conventional learning method in terms of perceptual similarity. This study revealed the following key findings: (1) the instrumental parts that predominantly affect the track-level similarity differ across music triplets and listeners, with the influence of the differences across music triplets exceeding the differences across listeners, indicating that part-level similarity helps in estimating track-level similarity; (2) when a temporal averaging is applied, the output of the deep learning models shows a closer correspondence with the perceptual evaluation based on timbre than on rhythm, indicating a potential area for improvement in the models; (3) the similarity between temporally distinct segments within the same music track is significantly perceived to be significantly higher than that between segments from different tracks, which supports the assumption of the conventional unsupervised learning method developed for music similarity estimation.

1. Introduction

Measuring music similarity is an important research topic owing to its critical role in music retrieval and recommendation systems. Such systems require similarity measures that reflect listeners’ perceptual similarity (Berenzweig et al., 2003; Ellis et al., 2002). However, it is difficult to label all of the huge music collections. Therefore, many works on techniques for automatically estimating music similarity have been presented, including approaches based on Gaussian mixture models (Aucouturier and Pachet, 2002; Pampalk, 2006), string-based representations (Casey and Slaney, 2006), sampling with collaborative filtering (McFee et al., 2012), binary code representations (Schlüter, 2013), transfer learning (Hamel et al., 2013), metric learning (Wolff Daniel, 2014), path-based methods (Gabbolini and Bridge, 2021) and perception-related acoustic features identified from user surveys (Cheng et al., 2020), as well as the analysis of the reliability of similarity evaluation (Urbano et al., 2011).

We proposed deep metric learning methods for estimating music similarity focusing on each instrumental part (part-level similarity) (Hashizume et al., 2025, 2022), in contrast to previous studies that targeted the similarity of entire music tracks (track-level similarity). Our previous works aimed to achieve flexible music retrieval and recommendation that allow users to choose the point to focus on. We used deep learning, which has become a standard approach for automatic music feature estimation and has shown strong performance in recent years. A variety of data-driven methods for track-level music feature estimation have been developed using available labels, including genre (Choi et al., 2019; Elbir and Aydin, 2020; Fathollahi and Razzazi, 2021; Hamel and Eck, 2010; Li et al., 2010), artist information (Cleveland et al., 2020; Lee et al., 2019; Park et al., 2018) and human annotation (Lu et al., 2017). Moreover, recent studies have shown that deep metric learning is effective for learning meaningful music representations (Lee et al., 2020, 2019; Lu et al., 2017; Park et al., 2018; Prétet et al., 2020). However, unlike previous studies, our problem setting lacks labels suitable for training, that is, part-level similarity labels. To address this issue, we used an unsupervised learning method with a loss function that makes the distance between temporally distinct segments within an instrumental part of a music track small, which benefits from being capable of learning without labeled data. One limitation of this method is that the correspondence of the learned similarity measure to the perceptual similarity measure is unclear, as perceptual music similarity is not used as an explicit target during training.

To investigate the correspondence between the estimated music similarity and perceptual music similarity, subjective evaluations obtained from a listening test are required. For many years, work has been done on perceptual music similarity through listening tests. Studies that clarify the factors listeners prioritize when perceiving music (Eerola et al., 2001; Mcadams and Matzkin, 2001; Novello et al., 2011) contribute to the design of methods that automatically estimate similarity based on listeners’ perception. Addressing how to aggregate responses from multiple participants (Eerola et al., 2001; Mcadams and Matzkin, 2001; Novello et al., 2011) is essential for constructing reliable ground truth labels for model evaluation. However, no listening tests have been conducted in which participants evaluate the part-level music similarity.

To address this limitation, we conducted an ABX-style listening test for the part-level similarity with 632 participants and collected 26,898 responses in our previous work (Hashizume and Toda, 2025). The perceptual similarity was evaluated from four perspectives: timbre, rhythm, melody and overall, with overall denoting a comprehensive assessment integrating the other three. We analyzed the results of the listening test and found the following:

The relative perceptual music similarity among three tracks (i.e. the similarity comparison between an X-A pair and an X-B pair) varies depending on the instrumental part that listeners focus on.
Rhythm and melody tend to have a larger impact on perceptual music similarity for each instrumental part than timbre except for the melody of drums.
Our previously developed model (Hashizume et al., 2025) mainly captures timbre-related similarity. Furthermore, the responses collected in the ABX test were released as the dataset ^[1].

Although our previous work contributed some knowledge, further insights are needed to support the development of future estimation models. While we showed that relative similarity between musical tracks depends on the instrumental part to focus on, how entire track-level similarity is determined remains unexplored. In addition, while our previous work evaluated our own model, it did not include evaluations across multiple models, leaving general conclusions yet to be addressed. Furthermore, although we collected evaluations under the same sample selection manner as in our unsupervised learning framework, these evaluations have not yet been analyzed.

In this study, we aim to investigate part-level similarity from the following aspects through our analysis:

which instrumental parts predominantly affect track-level perceptual music similarity;
the correspondence between the output of the deep learning models and perceptual similarity in terms of timbre, rhythm and melody; and
the validity of the assumption underlying conventional unsupervised learning methods for music similarity estimation, namely, that the similarity between temporally distinct segments within the same track is higher than that between segments from different tracks.

2. Related works

One of the objectives of our study is to clarify the perception of part-level similarity to support the development of part-level similarity estimation models. Several studies have previously analyzed the perception of music similarity and contributed to the estimation of track-level perceptual similarity. They have clarified which factors influence similarity perception, for example, the impact of rhythm compared to pitch (Mcadams and Matzkin, 2001), the role of frequency-based descriptors (Eerola et al., 2001) and the effects of genre, tempo and timbre (Novello et al., 2011).

Another objective of our study is to create an evaluation data set for part-level similarity estimation models based on subjective evaluations obtained from multiple participants. Similar efforts have been made in several studies for track-level similarity. Several studies have constructed ground truth data sets for music similarity, using both nonexpert subjective evaluations and expert assessments. For example, Ellis et al. (2002) and Berenzweig et al. (2003) collected large-scale evaluations through a listening test to estimate artist similarities. Typke et al. (2005) and Novello et al. (2011) built ground truth data sets for popular music similarity, revealing high consistency in perceptual evaluations across listeners. Müllensiefen and Frieler (2007) and Volk and van Kranenburg (2012) further provided annotated data sets focusing on melodic similarity.

The above studies focused on track-level perceptual similarity. Research closer to our part-level similarity work has conducted listening tests using individual instrumental sounds as stimuli (Lakatos, 2000; McAdams et al., 1995). These studies mainly conducted listening tests focusing on musical timbre and showed that perceptual similarity evaluations are consistent among listeners and relate to acoustic features of the instruments.

The studies mentioned above have all provided important insights into music perception and are closely related to our work. However, these studies do not cover the analysis or data set construction of part-level perceptual similarity together with the track-level similarity, using audio of individual instrumental parts and that of the original music tracks composed of them.

3. Listening test for perceptual music similarity evaluation

In this section, we describe the listening test conducted in our previous work (Hashizume and Toda, 2025). The listening test was conducted on the web page shown in Figure 1 with the participants recruited through crowdsourcing. To investigate the part-level perceptual music similarities, we conducted a listening test based on the ABX test on the similarity of sample sets consisting of instrumental parts and original music tracks. Participants provide responses for each sample set, which refers to a set of three audio samples labeled A, B and X. A collection of sample sets that participants respond to in a single listening test session is called an evaluation set. For each sample set, participants provide responses from four perspectives: timbre, rhythm, melody, and overall for each sample set. These perspectives were determined based on previous findings indicating that music similarity is influenced by genre, tempo and timbre (Novello et al., 2011), by rhythm and pitch (Mcadams and Matzkin, 2001) and by melodic (Eerola et al., 2001; Volk and van Kranenburg, 2012) and timbral (Lakatos, 2000; McAdams et al., 1995) characteristics. Each sample set consisted of either three different samples from the same instrument category (e.g. all three are drum sounds) or three different music tracks. One evaluation set included instrumental parts from five different categories (drums, bass, piano, guitar, others) and original music tracks.

Figure 1.

Interface for rating timber rhythm melody and total for Reference A and B with graded options.

View large Download slide

Interface showing Reference A and B playback controls with pause and play buttons, followed by rating sections for timber rhythm melody and total, each offering A plus A minus N slash A B minus and B plus options, with an instruction in English asking users to choose their response according to the above explanation and navigation buttons labelled Previous and Next below.

Image of the web page used in the listening test. Each sample set has play buttons for three samples and response buttons for four perspectives: timbre, rhythm, melody and the total similarity (i.e. the overall similarity across these three perspectives; hereinafter referred to as “overall” in the paper). In this figure, “Reference” corresponds to X as used in this paper. The actual screen shows the instructions for the procedure of the listening test above this. In addition to the content described in Section 3.1, the instructions state that participants are required to listen to each sample from the beginning to the end. Furthermore, the “Next” button remains disabled until all samples have been played and responses satisfying the specified constraints (will be described in Section 3.1) have been selected

3.1 Evaluation procedure for one sample set

As shown in Figure 1, participants were presented with three audio samples, X, A and B, and listened to all of them. They chose A+ if they perceived A to be more strongly similar to X than B, A− if slightly, B+ if they perceived B to be more strongly similar to X than A, and B− if slightly, on the basis of the following four perspectives: timbre, rhythm, melody, and overall.

In each response, the participants were allowed to select N/A from up to two perspectives, except for overall. The following instruction was provided to the participants as cases where N/A can be selected: “A and B are similar/dissimilar to X of equal degree” or “The presented instrumental track has no element corresponding to the perspective, e.g. drums have no melody.”

3.2 Sample selection

An overview of sample selection is shown in Figure 2. We created two types of sample set: “ $χ α β$ ” and “ $χ χ^{'} γ$ ”. For $χ α β$ ⁠, we randomly selected three different music tracks ${χ_{i}, α_{i}, β_{i}}$ from the test set of the Slakh2100 data set. This combination is referred to as a music triplet. We randomly captured five-second segments from each instrumental part contained in the three tracks of that music triplet. Five-second segments were also randomly extracted from each original music track. Then, we obtained one sample set for each instrument or music track, {X, A, B} = ${χ_{i j} [t_{i j}^{(χ)} : t_{i j}^{(χ)} + 5], α_{i j} [t_{i j}^{(α)} : t_{i j}^{(α)} + 5], β_{i j} [t_{i j}^{(β)} : t_{i j}^{(β)} + 5]}$ ⁠. The subscript j represents each category of instrument (⁠ $j = 0, \dots, 5$ ⁠), where 0 represents drums, 1 bass, 2 piano, 3 guitar, 4 others, and 5 original music track. This selection was repeated four times $(i = 0, \dots, 3)$ ⁠, and 24 sample sets were created (6 categories × 4 repetitions).

Figure 2.

Diagram of evaluation set n showing chi alpha beta and chi chi prime gamma sample construction.

View large Download slide

Diagram titled Evaluation set n illustrates construction of sample set i j for chi alpha beta and chi chi prime gamma, where chi i alpha i and beta i are grouped and short segments of length 5 are selected using indexed time intervals t sub i j to t sub i j plus 5 to form chi i j alpha i j and beta i j, and similarly chi i chi prime i and gamma i are segmented into chi i j chi prime i j and gamma i j, with an example j equals 2 Piano indicated and repetition across 4 values of i from 0 to 3.

How to create one evaluation set. Examples of how to make one sample set for the piano part are also shown with a red line for $χ α β$ and $χ χ^{'} γ$ ⁠. Sample sets are created in the same manner as for other instrumental parts and the original music track, and the procedure is repeated four times

For $χ χ^{'} γ$ ⁠, we used the same X as in $χ α β$ ⁠, and one of the other two samples was taken from temporally distinct segments of the same music track as X. Namely, we randomly selected a music track $γ_{i}$ excluding $χ_{i}$ and replaced $α_{i}$ and $β_{i}$ with $χ_{i}$ and $γ_{i}$ ⁠, respectively, and the process was repeated in the same manner as for $χ α β$ ⁠. The 24 sample sets ${χ_{i j} [t_{i j}^{(χ)} : t_{i j}^{(χ)} + 5], χ_{i j} [t_{i j}^{(χ^{'})} : t_{i j}^{(χ^{'})} + 5], γ_{i j} [t_{i j}^{(γ)} : t_{i j}^{(γ)} + 5]}, (t_{i j}^{(χ)} \neq t_{i j}^{(χ^{'})}, i = 0, \dots, 3, j = 0, \dots, 5)$ were created.

Then, 48 sample sets were created as one evaluation set. This procedure was repeated with random sample selection, resulting in 60 evaluation sets. Hereafter, time indices are omitted for simplicity. To distinguish $χ_{i j} [t_{i j}^{(χ)} : t_{i j}^{(χ)} + 5]$ from $χ_{i j} [t_{i j}^{(χ^{'})} : t_{i j}^{(χ^{'})} + 5]$ ⁠, we denote the latter as $χ_{i j}^{'}$ ⁠.

3.3 Setup of the listening test

The presented audio samples were 136 music tracks and the instrumental parts that compose them. They were the tracks remaining from the 151 tracks in the test set of the redux subset of the Slakh2100 data set (Manilow et al., 2019), excluding tracks that do not contain enough instrumental parts. Fifty sample sets, consisting of 48 sample sets created as explained in Section 3.2 and two dummy sample sets, were shuffled and presented to the participants individually. The participants were not informed whether the sample set presented was $χ α β$ ⁠, $χ χ^{'} γ$ ⁠, or a dummy. The dummy sample sets were of two types: A is exactly the same audio as X, and B is exactly the same audio as X. In $χ χ^{'} γ$ ⁠, one of the two orderings, { $χ^{'}$ ⁠, $γ$ } or { $γ$ ⁠, $χ^{'}$ }, was randomly assigned to {A, B}. Since this study is aimed at establishing a system for use by general users rather than experts, listening tests were conducted by recruiting participants through a crowdsourcing platform (CrowdWorks, 2025). Participation was fully voluntary and anonymous, and no personally identifiable information was collected. Participants were free to withdraw from the experiment at any time. The average duration of a single session, in which one evaluation set was assessed, was approximately 30 min, and participants were compensated according to the standard rates of the platform.

3.4 Response aggregation

The following responses were excluded as they were inappropriate for analysis: all responses from participants who did not select the same sample as X in the dummy tests; responses that took less than 15 s including listening time; responses with blanks in the technical problems; and duplicate responses from the same participant within each evaluation set. After excluding them, by counting the set of evaluations on timbre, rhythm, melody and overall as a single response, we obtained 26,898 valid responses from a total of 632 participants (281 unique participants). For all 60 evaluation sets, valid responses from at least six different participants were obtained. Note that this count also includes responses given only for instrumental parts. This is because the listening test was conducted in two phases: in the first phase, only evaluations of individual instrumental parts were collected to obtain instrument labels, namely, responses to 42 sample sets (5 categories × 4 repetitions × 2 types, plus 2 dummy sets); in the second phase, participants responded to complete evaluation sets, which consisted of 50 sample sets including the original music tracks. The total number of valid participants who responded to the complete evaluation sets was 346 (167 unique participants); the total number of valid responses to the complete evaluation sets was 16317. Figure 3 shows a histogram of the number of participants providing valid responses and the number of valid responses to each evaluation set.

Figure 3.

Four bar charts showing participants and responses across set index.

View large Download slide

Four bar charts display distribution across set index from 0 to about 58, where charts a and c show number of participants per set mostly ranging from about 4 to 16 with moderate fluctuation and no extreme spikes, and charts b and d show number of responses per set ranging roughly from 200 to 700 with noticeable peaks around higher set indices and broader variability compared with participant counts.

Number of participants providing valid responses and number of valid responses to each evaluation set. (a) Number of all participants providing valid responses, (b) number of all valid responses, (c) number of participants who provided valid responses to the complete evaluation sets, (d) number of valid responses to the complete evaluation sets

In addition, the number of strong responses (A+ or B+), the number of weak responses (A− or B−), and the number of N/A responses are shown in Figure 4. In the case of drums, there were many N/A responses for melody. This is assumed to be because many participants evaluated that “drums have no melody,” which was given as an example in the instructions.

Figure 4.

Two heatmaps of response counts for plus minus and N slash A across instruments.

View large Download slide

Two heatmaps labelled chi alpha beta and chi chi prime gamma present response counts for plus minus and N slash A across drums bass piano guitar others and music for timbre melody and rhythm, where chi alpha beta shows dominant minus counts often above 800 to 900 across most instrument feature pairs with plus generally between about 150 and 300 and N slash A around 250 to 400, while chi chi prime gamma shows dominant plus counts often between about 700 and 900 with minus typically around 400 to 600 and N slash A mostly below 200 indicating a shift towards positive responses.

Heatmaps showing the number of strong responses (A+ or B+), the number of weak responses (A− or B−) and the number of N/a responses for each instrument and each perspective. Note that only the responses to the complete evaluation set are shown here to enable comparison between the original music tracks and each instrumental part

4. Analysis

In this section, we examine three questions through analysis:

Which instrumental parts are dominant in determining the track-level music similarity?
Is there a correspondence between the similarity criteria output by the deep learning models and perceptual music similarity in terms of timbre, rhythm and melody?
Is the similarity between temporally distinct segments within the same music track perceived as higher than that between segments from different tracks?

In this paper, unless otherwise noted, we define “A+” and “A−” as the same response, and “B+” and “B−” as the same response; in other words, the response is treated as A, B or N/A. (For overall, the response is either A or B).

4.1 Impact of each instrumental part on track-level similarity

In this section, we examine which instrumental parts predominantly contribute to the determination of the track-level music similarity, as in Figure 5. (Here, we focus only on $χ α β$ ⁠). As shown in our previous work (Hashizume and Toda, 2025), the perceived similarity between music tracks varies depending on the instrumental part being focused on (Figure 6). This result raises the question of which instrumental parts listeners prioritize when determining the track-level similarity. For example, as in Figure 5, if a participant perceives that music track A is more similar to X for the bass, piano, and other parts, but music track B is more similar to X for the drums and guitar, and evaluated that music track B is more similar to X for the original music track (i.e. mixed sounds), then we can consider that the drums and guitar are dominant in that music triplet for that participant.

Figure 5.

Comparison of A X B sequences highlighting instrument positions.

View large Download slide

Diagram showing three sequences labelled A X and B each linked to sets of instrument icons including drums guitar piano electric guitar trumpet and violin, where selected instruments are emphasised within each sequence to illustrate comparison across corresponding positions between A X and B, and stars below mark key comparison points across the sequences.

Highlight the pair that a participant perceives as more similar when they listen to the instrumental part sets that compose the same music triplet. In this example, the drums and guitar are considered dominant because their evaluations are consistent with the track-level similarity

Figure 6.

Triangular similarity heatmap of instruments D r, B a, P i, G u, O t, M u with strongest values on the diagonal.

View large Download slide

Triangular heatmap showing similarity values from 0.0 to 1.0 across instruments D r, B a, P i, G u, O t, M u, with the diagonal at 1.0 indicating perfect self similarity and off diagonal values mostly around 0.40 to 0.60 showing moderate similarity between different instruments, with slightly higher similarity between P i and G u and lower similarity between more distinct pairs such as D r and M u.

Our previous work (Hashizume and Toda, 2025) showed that the relative perceptual music similarity among three tracks (i.e. the similarity comparison between an X-A pair and an X-B pair) varies depending on the instrumental part being listened to. In the example shown in Figure 5, the drum and guitar are matching, whereas the drum and piano are mismatching. This figure summarizes all responses from all participants across all music triplets and shows the matching rate for each pair of instruments (in $χ α β$ ⁠)

We calculated the matching rates with the track-level evaluation for each instrumental part evaluation, and Figure 7 shows the results for each music triplet and for each participant, along with dendrograms obtained using Ward’s hierarchical clustering (Ward, 1963). In Figure 7(a), light colors indicate that participants were divided in their evaluation of the dominant part. Dark colors represent a high level of agreement among participants: the red indicates that the participants perceived the part as dominant, whereas the blue indicates that they perceived it as not dominant. In Figure 7(b), light colors indicate that the participant’s evaluation changes depending on the music triplet. Dark colors indicate a high level of consistency in a participant’s evaluations across different music triplets: the red indicates that the participant consistently perceived the part as dominant, whereas the blue indicates that the part was consistently perceived as not dominant. It can be seen that the dominant instrumental parts vary across both music triplets and participants. Clusters that corresponded to patterns associated with particular combinations of instruments were formed. For example, in (a), when the dendrogram is cut to yield four clusters, the bottom cluster represents a group of music triplets where the other part is consistently disregarded among participants. Furthermore, this cluster is divided at the next branch into two subclusters: one in which the bass part is regarded, and another in which it is disregarded. We also present a two-dimensional plot obtained by applying t-SNE for dimensionality reduction, with colors indicating clusters identified using Ward’s method in Figure 8. These results suggest that the instrumental parts that listeners focus on can be grouped according to either the music triplet or the listener.

Figure 7.

Two clustered heatmaps showing response patterns by music triplet and by participant.

View large Download slide

Two clustered heatmaps present normalised response values from 0.00 to 1.00, where the left heatmap organises rows by music triplet index and columns bass others drums piano guitar showing distinct blocks of higher values concentrated for specific instruments within clusters, and the right heatmap organises rows by participant index and columns piano others guitar drums bass revealing participant level variability with some clusters favouring particular instruments while others show mixed or contrasting response patterns.

The heatmaps show the matching rates between track-level evaluations and the evaluations for each instrumental part, represented as a five-dimensional vector. (a) The matching rates calculated across all participants who responded to one music triplet, and (b) the matching rates calculated across all music triplets responded to by one participant. In this figure, the vectors are reordered based on hierarchical clustering using Ward’s method (Ward, 1963), with dendrograms displayed on the left and top sides. Horizontally, clustering is performed by instrumental parts. Vertically, clustering is performed by music triplets in (a) and by participants in (b)

Figure 8.

Two scatter plots of embedded points for music triplets and participants.

View large Download slide

Two scatter plots display two dimensional embeddings with axes spanning approximately minus 15 to 15 on the left and minus 10 to 10 on the right, where coloured points representing 32 indexed groups form dispersed clusters in circular arrangements, the music triplet view showing broader spread with several tight local clusters and clearer separation between some groups, and the participant view showing more overlap between coloured groups with moderate clustering and less distinct separation overall.

Two-dimensional plots of the five-dimensional vectors shown in Figure 7 reduced using t-SNE. The colors indicate the 32 clusters obtained by cutting the dendrogram from Ward’s hierarchical clustering, as presented in Figure 7

In addition, comparing the two heatmaps in Figure 7(a): based on music triplets and (b): based on participants, we can see that (a) contains more darker-colored cells. This suggests that the evaluations of the same music triplet show high agreement across participants in terms of whether the instrumental part is perceived as dominant. To quantify this, we tested whether there is a significant difference between the variability of evaluations across different participants for the same music triplet and the variability of evaluations across different music triplets from the same participant. Specifically, for each instrumental part, we first construct data in which evaluations were coded as 1 if the track- and part-level evaluations matched, and 0 if they did not. Then, we examined whether the mean of the variances of the data within each music triplet was smaller than the mean of the variances of the data within each participant, using the Mann–Whitney U test (Mann and Whitney, 1947). The results are summarized in Table 1. For all instruments, the within-triplet variance was lower than the within-participant variance on average. Although the effect sizes were small, the difference was statistically significant (⁠ $p < 0.01$ ⁠) for all instruments except the piano. This indicates a consistent tendency across data rather than a large separation between the two distributions. In other words, the tendency for different listeners to perceive similarly to a given music triplet is stronger than the tendency for the same listener to consistently focus on a particular instrument. This result implies that which instrumental parts are perceived as dominant is more likely to depend on music triplets than on individuals. Taken together with all of this section’s results, certain instrumental parts have a strong influence on music similarity, and while there are individual differences, there is a considerable level of agreement across multiple listeners. Therefore, estimating which instruments are dominant and leveraging this information in track-level similarity estimation is expected to improve the accuracy of general track-level similarity measures.

Table 1.

The mean of the variances of the data within each participant (averaged across participants) and the mean of the variances of the data within each music triplet (averaged across music triplets), where the data represent binary values (1 or 0) indicating whether the evaluations matched the track-level evaluations

Statistic	Drums	Bass	Piano	Guitar	Others
Mean variance in a triplet	0.21496	0.22861	0.22288	0.22473	0.22137
Mean variance in a participant	0.25918	0.24173	0.23684	0.25223	0.24672
p-value	0.00000	0.00906	0.09272	0.00001	0.00170
Effect size	0.252	0.125	0.070	0.225	0.154

Statistic	Drums	Bass	Piano	Guitar	Others
Mean variance in a triplet	0.21496	0.22861	0.22288	0.22473	0.22137
Mean variance in a participant	0.25918	0.24173	0.23684	0.25223	0.24672
p-value	0.00000	0.00906	0.09272	0.00001	0.00170
Effect size	0.252	0.125	0.070	0.225	0.154

Note(s):

The p-values for testing whether the former is smaller than the latter based on the Mann–Whitney U test, as well as the absolute effect sizes (|r|), are also shown. Since the number of evaluations per music triplet and per participant differed in range, we filtered the data to include only music triplets and participants with between 4 and 9 evaluations, where the ranges overlapped. After this filtering, the average number of evaluations per music triplet was 5.74, and the average number of evaluations per participant was 4.68. We calculated the means after the filtering

4.2 Performance evaluation of the deep learning models

In our previous work (Hashizume and Toda, 2025), we assessed the performance of our model (Hashizume et al., 2025) using evaluations obtained in the listening test. We showed that our model (Hashizume et al., 2025) mainly captures timbral features. Our work also showed that rhythm and melody are prioritized over timbre in perception (Table 2), highlighting a gap that points to potential areas for improving our method. However, evaluating the performance of a single model provided only limited insight. Therefore, in this study, we conducted the same evaluation on the large-scale self-supervised learning model, MERT (Li et al., 2024), to gain further insight.

Table 2.

The matching rate with overall for each perspective [%] [shown in our previous work (Hashizume and Toda, 2025)]

Perspectives	Drums	Bass	Piano	Guitar	Others	Music
Timbre	74.7	74.5	75.6	75.1	75.3	74.3
Rhythm	84.4	80.1	78.4	80.6	79.8	82.0
Melody	56.7	79.0	80.7	81.0	83.1	82.8

Note(s):

For a given sample set, the perspectives for which a participant’s evaluations matched the overall evaluation were considered to be given priority, and thus, perspectives with higher matching rates were interpreted as more important in perception. Note that N/A is included in counting and treated as a mismatch with both A and B

4.2.1 Models to be evaluated.

4.2.1.1 Our previous developed model.

We developed an instrumental-part-level similarity learning method with a single network that takes original music tracks (i.e. mixed sounds) as input (Hashizume et al., 2025). We designed a similarity embedding space with separated subspaces for each instrumental part using conditional similarity networks (Veit et al., 2017). To separate the embedding space, a mask is applied to all dimensions except for the dimension corresponding to the notion to be considered in the triplet loss calculation. That is, for example, when learning the drum space, a binary mask that retains only the dimensions assigned to drums is applied, and the distance between the masked feature representations is used for metric learning with the triplet loss. When learning the drum space, it is necessary to sample data based on (dis)similarity, focusing on the drum part. To address this, we introduced pseudo musical piece, which extends unsupervised learning under the assumption that segments within the same track are similar (see Appendix 1 for details). The model input was a mel-spectrogram converted from the original music track signal, and the output was a 640-dimensional vector in which five instruments (drums, bass, piano, guitar and others) were each assigned a 128-dimensional subspace. The model was trained using the training data from the Slakh2100 data set (Manilow et al., 2019). The network consisted of ten convolutional layers, followed by temporal average pooling and a fully connected layer.

4.2.1.2 MERT with average poolings.

We also conducted a performance evaluation using MERT (Li et al., 2024), a pretrained model based on self-supervised learning that can extract musical features. Unlike the model developed in our previous work (Hashizume et al., 2025), MERT does not have a mechanism for extracting the distinct features of each instrumental part from original music tracks, so we input the instrumental part signals instead. We downsampled the signals to 24 kHz, extracted 3D tensors with 25 layers × T time steps × 1024 feature dimensions from the intermediate layers of the pre-trained model ^[2]. We then applied average pooling along the layer and time axes to obtain 1024-dimensional representations. Noted that in this type of framework, where individual instrumental signals are used as inputs, it is generally necessary to use estimated signals obtained from source separation of the original music track, as in the previous work (Hashizume et al., 2022), because true individual instrumental signals are typically not available at inference. However, in this experiment, to eliminate the influence of separation performance and simplify the discussion, we used estimated signals at their ideal upper-bound performance, that is, stems, as inputs.

4.2.2 Performance evaluation procedure.

We calculated whether A or B was closer to X using the models for the same set as used in this listening test and then calculated the matching rate between the models’ results and the subjective evaluation results, as shown in Figure 9. The calculation of distances using the model was performed as follows: The music tracks originally containing the instrumental parts used in the listening test were input into the model, the instrumental-part level feature representations were extracted from them, and then the distances were measured between them. Sample sets with low agreement among participants indicate that both A and B are equally (dis)similar to X, making them unsuitable for this performance evaluation method. Therefore, only evaluations with agreement equal to or higher than 80% among participants were used in the evaluation. This filtering focuses on reliable perceptual judgments by emphasizing high-consensus cases, while trading off broader sample coverage. To avoid excessively reducing the number of evaluation samples, we set the consensus threshold to 80% in this study, balancing reliability and coverage. Note that if N/A accounts for the largest percentage, that sample set was excluded from the evaluation set.

Figure 9.

Diagram comparing participant similarity judgement with model feature distance for drum parts.

View large Download slide

Diagram illustrating comparison between participant judgements and model predictions for drum parts, where music tracks A X and B contain drum components that are extracted and represented as feature points in a two dimensional space, participants judge that A is more similar while the model determines that A is closer based on feature distance, and a central match indicates alignment between human similarity judgement and model proximity.

How to evaluate the models

The numbers of unique sample sets and evaluations for them used for this performance evaluation are shown in Table 3. We can see that there is more agreement among participants in their evaluations in $χ χ^{'} γ$ than in $χ α β$ ⁠. Moreover, there are fewer sample sets for timbre, rhythm, and melody compared with overall because the participants cannot select N/A in overall, but can in the others, and here sample sets that received N/A from 80% or more of participants were omitted from the evaluation.

Table 3.

Number of unique sample sets and number of evaluations used for performance evaluation

Perspectives	Drums	Bass	Piano	Guitar	Others
(a) Number of unique sample sets of $χ α β$
Overall	92	95	83	90	95
Timbre	49	44	37	42	43
Rhythm	56	47	26	33	38
Melody	4	41	21	25	47
(b) Number of unique sample sets of $χ χ^{'} γ$
Overall	212	180	210	209	192
Timbre	212	192	214	206	197
Rhythm	201	164	192	192	185
Melody	211	177	203	200	188
(c) Number of evaluations of $χ α β$
Overall	912	949	836	925	977
Timbre	463	420	330	412	414
Rhythm	517	433	233	317	355
Melody	32	363	186	238	437
(d) Number of evaluations of $χ χ^{'} γ$
Overall	2143	1792	2082	2096	1926
Timbre	1926	1745	2044	1993	1892
Rhythm	1983	1555	1800	1819	1721
Melody	1391	1589	1905	1898	1786

Perspectives	Drums	Bass	Piano	Guitar	Others
(a) Number of unique sample sets of $χ α β$
Overall	92	95	83	90	95
Timbre	49	44	37	42	43
Rhythm	56	47	26	33	38
Melody	4	41	21	25	47
(b) Number of unique sample sets of $χ χ^{'} γ$
Overall	212	180	210	209	192
Timbre	212	192	214	206	197
Rhythm	201	164	192	192	185
Melody	211	177	203	200	188
(c) Number of evaluations of $χ α β$
Overall	912	949	836	925	977
Timbre	463	420	330	412	414
Rhythm	517	433	233	317	355
Melody	32	363	186	238	437
(d) Number of evaluations of $χ χ^{'} γ$
Overall	2143	1792	2082	2096	1926
Timbre	1926	1745	2044	1993	1892
Rhythm	1983	1555	1800	1819	1721
Melody	1391	1589	1905	1898	1786

Note(s):

Only sample sets with an agreement rate of 80% or higher among participants and evaluations for them were included

4.2.3 Results.

The performance evaluation results of our model are shown in Figure 10. Although $χ α β$ is less accurate than $χ χ^{'} γ$ ⁠, accuracy is higher in drums, piano, and guitar in the performance evaluation using evaluations focusing on timbre compared with using those focusing on overall. In guitar, accuracy is higher even in the melody from overall. Note that although the drum melody also shows a relatively high value, this result is based on a significantly smaller number of samples than for other results. Furthermore, we consider that this evaluation is based on the perceived similarity of drum patterns (see Appendix 2). In all instruments, rhythm tends to show an equal or lower accuracy than the other perspectives. This suggests that frequency-related features are captured by the model, but rhythmic features are not. This is considered to be due to the network architecture, which contains the temporal average pooling. The performance evaluation results of MERT with average poolings are shown in Figure 11. Note that, as mentioned in Section 4.2.1, since this performance was based on inputting ideal instrumental signals (stems), a direct comparison between this result and that of our previous work is not possible. On the other hand, similar trends are observed between the model in our previous work and MERT with temporal and channel-wise average pooling, namely that performance on rhythm is lower than on the other perspectives. These findings suggest that temporal average pooling in deep learning models leads to a loss of rhythm information in any instrumental part.

Figure 10.

Two scatter plots show accuracy by instrument for overall timbre rhythm and melody under two conditions.

View large Download slide

Two scatter plots show accuracy from 0.0 to 1.0 on the vertical axis against instruments Drums Bass Piano Guitar and Other on the horizontal axis for overall timbre rhythm and melody under chi alpha beta and chi chi prime gamma, with chi alpha beta showing moderate accuracy overall around 0.55 to 0.75 with melody often higher for Bass and Guitar reaching around 0.80 to 0.90 while Piano and Other remain closer to 0.55 to 0.70 and rhythm and timbre varying by instrument, whereas chi chi prime gamma shows consistently high accuracy across all instruments and measures clustered around 0.88 to 0.97 with minimal spread indicating near ceiling performance.

Graphs of accuracy and the 95% confidence intervals of the previously developed model (Hashizume et al., 2025) for each instrument and each perspective. The 95% confidence intervals were calculated using the Clopper–Pearson method (Clopper and Pearson, 1934). Colors represent instruments, and symbols represent perspectives. This result was shown in our previous work (Hashizume and Toda, 2025)

Figure 11.

Two scatter plots display instrument wise accuracy trends for overall timbre rhythm and melody.

View large Download slide

Two scatter plots display accuracy from 0.0 to 1.0 for Drums Bass Piano Guitar and Other comparing chi alpha beta and chi chi prime gamma across overall timbre rhythm and melody, with chi alpha beta showing noticeable variation where Drums range roughly from 0.50 to 0.85 Bass from about 0.65 to 0.85 Piano from about 0.60 to 0.85 Guitar from about 0.55 to 0.90 and Other from about 0.60 to 0.88 and melody frequently among the highest values especially for Bass and Guitar, while chi chi prime gamma shows tightly grouped high values between about 0.89 and 0.96 for all instruments and measures demonstrating stable and uniformly strong accuracy.

Graphs of accuracy and the 95% confidence intervals (Clopper and Pearson, 1934) of MERT for each instrument and each perspective. Colors represent instruments, and symbols represent perspectives

In conjunction with the finding in our previous work that listeners placed more importance on rhythm or melody than on timbre (Table 2), we consider that if we can design a model that captures the structure of the time direction so that rhythm can be considered, it will be possible to obtain a music similarity that is also compatible with listeners’ perception on overall. This could be further explored in future work by incorporating attention-based temporal modeling or by learning representations through downstream rhythm-related tasks (e.g. beat tracking), among other possible approaches.

4.3 Similarity within and between music tracks

In this section, we examine whether the similarity between temporally distinct segments within the same music track is perceived to be higher than that between different music tracks. We used the unsupervised learning method with triplet loss, assuming that the similarity between temporally distinct segments within the same music track is higher than that between different music tracks in our previous works (Hashizume et al., 2025, 2022). As shown in Figure 12, the participants compare two between-music similarities in $χ α β$ ⁠, whereas they compare the within-music similarity with the between-music similarity in $χ χ^{'} γ$ ⁠. The data selection setting in $χ χ^{'} γ$ is the same as our unsupervised learning method. Table 4 shows the rate at which the segment from the same music track as X, namely, $χ^{'}$ ⁠, was selected in $χ χ^{'} γ$ ⁠. The 95% confidence intervals were calculated using the Clopper–Pearson method (Clopper and Pearson, 1934). Considering all evaluations, including N/A, the percentages of evaluations exceed 70% for all except the melody of drums. When N/A evaluations are excluded, the percentages of evaluations exceed 80% for all, including the melody of drums. The results show that the similarity between randomly extracted segments within the same music track is perceived to be significantly higher than that between different music tracks. This suggests that our previous method (Hashizume et al., 2025, 2022) is appropriate for realizing the goal of learning similarity that aligns with human perception. Furthermore, we can see from Figure 4 that the number of participants who perceived a strong similarity is higher for $χ χ^{'} γ$ than for $χ α β$ ⁠. These indicate that within-music similarity is perceived to be significantly higher than between-music similarity.

Figure 12.

Diagram comparing chi alpha beta and chi chi prime gamma showing between and within relationships.

View large Download slide

Diagram comparing chi alpha beta and chi chi prime gamma, each presenting sequences of waveform segments arranged horizontally, where in chi alpha beta 3 segments labelled A X B are placed in order and relationships are described as between segments across different positions, while in chi chi prime gamma relationships are described as within a single segment as well as between segments, with highlighted portions indicating the specific sound intervals being compared and a legend at the bottom clarifying the meaning of between and within comparisons.

Explanation of between-music similarity and within-music similarity. In $χ χ^{'} γ$ ⁠, one sample is a different segment of the same music track as X. Thus, one of the similarities to evaluate is the within-music similarity

Table 4.

Percentage of evaluations selecting the segment from the same music track (⁠ $χ^{'}$ ⁠) in $χ χ^{'} γ$ ⁠, with the 95% confidence intervals [%]

Perspectives	Drums	Bass	Piano	Guitar	Others	Music
(a) Percentage of $χ^{'}$ among all evaluations including N/A
Overall	91.49–93.62	83.91–86.77	89.02–91.43	89.0–91.41	86.48–89.13	87.82–91.15
Timbre	82.4–85.37	78.16–81.39	86.24–88.91	84.8–87.59	83.12–86.04	79.57–83.76
Rhythm	86.94–89.55	72.87–76.37	79.64–82.79	80.71–83.8	76.53–79.87	81.73–85.73
Melody	57.89–61.85	73.33–76.81	82.37–85.34	82.04–85.04	79.05–82.24	81.96–85.94
(b) Percentage of $χ^{'}$ among evaluations excluding N/A
Timbre	92.17–94.32	86.8–89.55	91.63–93.8	89.53–91.94	88.6–91.12	86.57–90.19
Rhythm	89.9–92.25	79.61–82.90	86.69–89.42	86.17–88.93	84.24–87.21	87.38–90.87
Melody	90.63–93.37	83.13–86.22	88.7–91.22	88.14–90.72	85.05–87.91	89.3–92.55

Perspectives	Drums	Bass	Piano	Guitar	Others	Music
(a) Percentage of $χ^{'}$ among all evaluations including N/A
Overall	91.49–93.62	83.91–86.77	89.02–91.43	89.0–91.41	86.48–89.13	87.82–91.15
Timbre	82.4–85.37	78.16–81.39	86.24–88.91	84.8–87.59	83.12–86.04	79.57–83.76
Rhythm	86.94–89.55	72.87–76.37	79.64–82.79	80.71–83.8	76.53–79.87	81.73–85.73
Melody	57.89–61.85	73.33–76.81	82.37–85.34	82.04–85.04	79.05–82.24	81.96–85.94
(b) Percentage of $χ^{'}$ among evaluations excluding N/A
Timbre	92.17–94.32	86.8–89.55	91.63–93.8	89.53–91.94	88.6–91.12	86.57–90.19
Rhythm	89.9–92.25	79.61–82.90	86.69–89.42	86.17–88.93	84.24–87.21	87.38–90.87
Melody	90.63–93.37	83.13–86.22	88.7–91.22	88.14–90.72	85.05–87.91	89.3–92.55

5. Conclusion

In this study, we investigated how perceptual music similarity when focusing on individual instrumental parts is perceived and related to the track-level similarity. We conducted a large-scale listening test and obtained 26,898 responses from 632 participants from four perspectives, that is, timbre, rhythm, melody and overall in our previous work (Hashizume and Toda, 2025). We conducted an extended analysis and obtained the following insights:

The relative perceptual similarity among the three samples at the part-level sometimes corresponds to the similarity at the track (mixed sounds) level and sometimes does not. Which instrumental part is dominant varies across music triplets and listeners, where parts whose similarity agrees with the track-level similarity can be considered dominant in the perception of track-level similarity. Moreover, the variance of evaluations across participants for the same music triplet is smaller than the variance of evaluations within the same participant across different music triplets, suggesting that differences among music triplets have a stronger influence than individual differences among participants. This indicates that certain instrumental parts tend to be more salient for listeners’ perception depending on the music track.
Although rhythm and melody tend to have a larger impact on perceptual music similarity for each instrumental part than timbre, the deep learning models with temporal average pooling capture the features related to timbre compared to rhythm. This suggests that temporal average pooling leads to the loss of rhythmic information.
The similarity between temporally distinct segments within the same music track is perceived to be significantly higher than that between segments from different tracks, supporting the unsupervised learning approach used in our previous work (Hashizume et al., 2025, 2022).

While our experiments are conducted on Slakh2100, a MIDI-rendered data set with a limited set of instrumental categories, conducting the same experiments on real acoustic instruments and vocals could yield more generalizable insights and remains an important direction for future work. As further future work, feature extraction that explicitly accounts for the instrumental parts listeners tend to focus on for each track can be leveraged to improve the general accuracy of track-level similarity estimation. Also, developing feature extraction models using temporal pooling methods that account for rhythm could improve the performance of the part-level perceptual music similarity estimation.

Appendix 1. Pseudo musical piece

A triplet sampling method suitable for our method (Hashizume et al., 2025) should satisfy the following requirement: when learning a subspace corresponding to an instrument, an anchor and a positive sample are similar, and the anchor and a negative sample are dissimilar focusing on that instrument. When learning the guitar subspace, triplet music tracks selected on the basis of similarity when focusing on the guitar part should be input. To satisfy this requirement without using manual labels, we introduce an unsupervised learning method adapted from the method used in our previous work (Hashizume et al., 2022). We have proposed the use of pseudo musical pieces, which are created by mixing instrumental parts from different musical tracks. By combining the guitar part from music track A with the non-guitar parts from music tracks B and C, respectively, we create two pseudo musical pieces. By temporally dividing these tracks, we construct a pair of segments that share temporally distinct segments of the same guitar part (both from A), whereas the other instrumental parts originate from different tracks (B and C). These can be used as anchors and positive samples. We also create a pseudo musical piece that includes guitar parts from a music track different from track A for a negative sample. We showed that this method can improve the model’s performance.

Appendix 2. Consideration for the melody of drums

The number of evaluations for the melody of drums after filtering is significantly smaller than that for the others because many participants selected N/A for the melody of the drums, as illustrated in Figure 4. Among the 281 unique participants, 39 selected N/A for all evaluations to the melody of drums. In other words, 242 participants provided at least one non-N/A evaluation for drum melody across the sample sets. Furthermore, there were no sample sets for which all participants responded with N/A. However, even so, the number of sample sets where more than or equal to 80% of participants agreed on a non-NA evaluation was only four in $χ α β$ (shown in Table 3). In other words, there were only a few sample sets in which participants exhibited higher sensitivity to differences in melody. In these sample sets, one of the two (A or B) had a drum pattern that was highly similar to X and easily distinguishable pitch contrasts, such as repeated alternations between bass and snare drums, whereas the other had a different drum pattern from X. We consider that these differences were perceived as melodic differences.

Notes

[1.]

Link to a PDF of the cited article.

[2.]

Link to a PDF of the cited article.

References

Aucouturier

,

J.J.

and

Pachet

,

F.

(

2002

), “

Music similarity measures: what’s the use?

”,

International Society for Music Information Retrieval Conference

, Vol.

7

, pp.

339

-

340

.

Google Scholar

Berenzweig

,

A.

,

Logan

,

B.

,

Ellis

,

D.

and

Whitman

,

B.

(

2003

), “

A large-scale evaluation of acoustic and subjective music similarity measures

”,

Computer Music Journal

, Vol.

28

No.

2

, pp.

63

-

76

.

Google Scholar

Crossref

Casey

,

M.

and

Slaney

,

M.

(

2006

), “

The importance of sequences in musical similarity

”,

IEEE International Conference on Acoustics Speech and Signal Processing Proceedings

, Vol.

5

, p.

4

.

Google Scholar

Crossref

Cheng

,

D.

,

Joachims

,

T.

and

Turnbull

,

D.R.

(

2020

), “

Exploring acoustic similarity for novel music recommendation

”,

International Society for Music Information Retrieval Conference

, pp.

583

-

589

.

Google Scholar

Choi

,

J.

,

Lee

,

J.

,

Park

,

J.

and

Nam

,

J.

(

2019

), “

Zero-shot learning for audio-based music classification and tagging

”,

International Society for Music Information Retrieval Conference

, pp.

67

-

74

.

Google Scholar

Cleveland

,

J.

,

Cheng

,

D.

,

Zhou

,

M.

,

Joachims

,

T.

and

Turnbull

,

D.

(

2020

), “

Content-based music similarity with triplet networks

”,

arXiv:2008.04938

.

Google Scholar

Clopper

,

C.J.

and

Pearson

,

E.S.

(

1934

), “

The use of confidence or fiducial limits illustrated in the case of the binomial

”,

Biometrika

, Vol.

26

No.

4

, pp.

404

-

413

.

Google Scholar

Crossref

CrowdWorks

(

2025

),

available at:

Link to a PDF of the cited article.

Eerola

,

T.

,

Järvinen

,

T.

,

Louhivuori

,

J.

and

Toiviainen

,

P.

(

2001

), “

Statistical features and perceived similarity of folk melodies

”,

Music Perception

, Vol.

18

No.

3

, pp.

275

-

296

.

Google Scholar

Crossref

Elbir

,

A.

and

Aydin

,

N.

(

2020

), “

Music genre classification and music recommendation by using deep learning

”,

Electronics Letters

, Vol.

56

No.

12

, pp.

627

-

629

.

Google Scholar

Crossref

Ellis

,

D.

,

Whitman

,

B.

,

Berenzweig

,

A.

and

Lawrence

,

S.

(

2002

), “

The quest for ground truth in musical artist similarity

”,

International Conference on Music Information Retrieval

, pp.

13

-

17

.

Google Scholar

Fathollahi

,

M.S.

and

Razzazi

,

F.

(

2021

), “

Music similarity measurement and recommendation system using convolutional neural networks

”,

International Journal of Multimedia Information Retrieval

, Vol.

1

No.

10

, pp.

43

-

53

.

Google Scholar

Gabbolini

,

G.

and

Bridge

,

D.

(

2021

), “

An interpretable music similarity measure based on path interestingness

”,

International Society for Music Information Retrieval Conference

, pp.

213

-

219

.

Google Scholar

Hamel

,

P.

and

Eck

,

D.

(

2010

), “

Learning features from music audio with deep belief networks

”,

International Society for Music Information Retrieval Conference

, pp.

339

-

344

.

Google Scholar

Hamel

,

P.

,

Davies

,

M.E.P.

,

Yoshii

,

K.

and

Goto

,

M.

(

2013

), “

Transfer learning in mir: sharing learned latent representations for music audio classification and similarity

”,

International Society for Music Information Retrieval Conference

, pp.

9

-

14

.

Google Scholar

Hashizume

,

Y.

and

Toda

,

T.

(

2025

), “

Investigation of perceptual music similarity focusing on each instrumental part

”,

2025 IEEE International Conference on Acoustics, Speech and Signal Processing

, pp.

1

-

5

.

Google Scholar

Crossref

Hashizume

,

Y.

,

Li

,

L.

and

Toda

,

T.

(

2022

), “

Music similarity calculation of individual instrumental sounds using metric learning

”,

Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

, pp.

33

-

38

.

Google Scholar

Crossref

Hashizume

,

Y.

,

Li

,

L.

,

Miyashita

,

A.

and

Toda

,

T.

(

2025

), “

Learning separated representations for instrument-based music similarity

”,

APSIPA Transactions on Signal and Information Processing

, Vol.

14

No.

1

, p.

32

.

Google Scholar

Crossref

Lakatos

,

S.

(

2000

), “

A common perceptual space for harmonic and percussive timbres

”,

Perception and Psychophysics

, Vol.

62

No.

7

, pp.

1426

-

1439

.

Google Scholar

Crossref

PubMed

Lee

,

J.

,

Park

,

J.

and

Nam

,

J.

(

2019

), “

Representation learning of music using artist, album, and track information

”,

arXiv:1906.11783

.

Google Scholar

Lee

,

J.

,

Bryan

,

N.J.

,

Salamon

,

J.

,

Jin

,

Z.

and

Nam

,

J.

(

2020

), “

Disentangled multidimensional metric learning for music similarity

”,

IEEE International Conference on Acoustics, Speech and Signal Processing

, pp.

6

-

10

.

Google Scholar

Crossref

Li

,

T.

,

Chan

,

A.

and

Chun

,

A.

(

2010

), “

Automatic musical pattern feature extraction using convolutional neural network

”,

Lecture Notes in Engineering and Computer Science

, p.

2180

.

Google Scholar

Li

,

Y.

,

Yuan

,

R.

,

Zhang

,

G.

,

Ma

,

Y.

,

Chen

,

X.

,

Yin

,

H.

,

Xiao

,

C.

,

Lin

,

C.

,

Ragni

,

A.

,

Benetos

,

E.

,

Gyenge

,

N.

,

Dannenberg

,

R.

,

Liu

,

R.

,

Chen

,

W.

,

Xia

,

G.

,

Shi

,

Y.

,

Huang

,

W.

,

Wang

,

Z.

,

Guo

,

Y.

and

Fu

,

J.

(

2024

), “

MERT: acoustic music understanding model with large-scale self-supervised training

”,

The Twelfth International Conference on Learning Representations.

Google Scholar

Lu

,

R.

,

Wu

,

K.

,

Duan

,

Z.

and

Zhang

,

C.

(

2017

), “

Deep ranking: triplet match-net for music metric learning

”,

IEEE International Conference on Acoustics, Speech and Signal Processing

, pp.

121

-

125

.

Google Scholar

Mcadams

,

S.

and

Matzkin

,

D.

(

2001

), “

Similarity, invariance, and musical variation

”,

Annals of the New York Academy of Sciences

, Vol.

930

, pp.

62

-

67

.

Google Scholar

Crossref

PubMed

McAdams

,

S.

,

Winsberg

,

S.

,

Donnadieu

,

S.

,

Soete

,

G.D.

and

Krimphoff

,

J.

(

1995

), “

Perceptual scaling of synthesized musical timbres: common dimensions, specificities, and latent subject classes

”,

Psychological Research

, Vol.

58

No.

3

, pp.

177

-

192

.

Google Scholar

Crossref

PubMed

McFee

,

B.

,

Barrington

,

L.

and

Lanckriet

,

G.

(

2012

), “

Learning content similarity for music recommendation

”,

IEEE Transactions on Audio, Speech, and Language Processing

, Vol.

20

No.

8

, pp.

2207

-

2218

.

Google Scholar

Crossref

Manilow

,

E.

,

Wichern

,

G.

,

Seetharaman

,

P.

and

Le Roux

,

J.

(

2019

), “

Cutting music source separation some SLAKH: a dataset to study the impact of training data quality and quantity

”,

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

, pp.

45

-

49

.

Google Scholar

Crossref

Mann

,

H.B.

and

Whitney

,

D.R.

(

1947

), “

On a test of whether one of two random variables is stochastically larger than the other

”,

The Annals of Mathematical Statistics

, Vol.

18

No.

1

, pp.

50

-

60

.

Google Scholar

Crossref

Müllensiefen

,

D.

and

Frieler

,

K.

(

2007

), “

Modelling experts’ notions of melodic similarity

”,

Musicae Scientiae

, Vol.

11

No.

1_suppl

, pp.

183

-

210

.

Google Scholar

Crossref

Novello

,

A.

,

McKinney

,

M.

and

Kohlrausch

,

A.

(

2006

), “

Perceptual evaluation of music similarity

”,

International Conference on Music Information Retrieval

, pp.

246

-

249

.

Google Scholar

Novello

,

A.

,

McKinney

,

M.

and

Kohlrausch

,

A.

(

2011

), “

Perceptual evaluation of inter-song similarity in western popular music

”,

Journal of New Music Research

, Vol.

40

No.

1

, pp.

1

-

26

.

Google Scholar

Crossref

Pampalk

,

E.

(

2006

), “

Computational models of music similarity and their application in music information retrieval

”, PhD Thesis,

Vienna University of Technology

, p.

40

.

Google Scholar

Park

,

J.

,

Lee

,

J.

,

Park

,

J.

,

Ha

,

J.

and

Nam

,

J.

(

2018

), “

Representation learning of music using artist labels

”,

International Society for Music Information Retrieval Conference

, pp.

717

-

724

.

Google Scholar

Prétet

,

L.

,

Richard

,

G.

and

Peeters

,

G.

(

2020

), “

Learning to rank music tracks using triplet loss

”,

IEEE International Conference on Acoustics, Speech and Signal Processing

, pp.

511

-

515

.

Google Scholar

Crossref

Schlüter

,

J.

(

2013

), “

Learning binary codes for efficient large-scale music similarity search

”,

International Society for Music Information Retrieval Conference

, p.

6

.

Google Scholar

Typke

,

R.

,

Hoed

,

M.

,

de Nooijer

,

J.

,

Wiering

,

F.

and

Veltkamp

,

R.

(

2005

), “

A ground truth for half a million musical incipits

”,

Journal of Digital Information Management

, Vol.

3

No.

1

, pp.

34

-

39

.

Google Scholar

Urbano

,

J.

,

Martín

,

D.

,

Marrero

,

M.

and

Morato

,

J.

(

2011

), “

Audio music similarity and retrieval: evaluation power and stability

”,

International Society for Music Information Retrieval Conference

, pp.

597

-

602

.

Google Scholar

Veit

,

A.

,

Belongie

,

S.

and

Karaletsos

,

T.

(

2017

), “

Conditional similarity networks

”,

IEEE Conference on Computer Vision and Pattern Recognition

, pp.

1781

-

1789

.

Google Scholar

Crossref

Volk

,

A.

and

van Kranenburg

,

P.

(

2012

), “

Melodic similarity among folk songs: an annotation study on similarity-based categorization in music

”,

Musicae Scientiae

, Vol.

16

No.

3

, pp.

317

-

339

.

Google Scholar

Crossref

Ward

,

J.H.

(

1963

), “

Hierarchical grouping to optimize an objective function

”,

Journal of the American Statistical Association

, Vol.

58

No.

301

, pp.

236

-

244

.

Google Scholar

Crossref

Wolff Daniel

,

W.T.

(

2014

), “

Learning music similarity from relative user ratings

”,

Information Retrieval

, Vol.

17

No.

2

, pp.

109

-

136

.

Google Scholar

Crossref

2026

Yuka Hashizume and Tomoki Toda

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence

Investigation of part-level perceptual music similarity by large-scale listening test

1. Introduction

2. Related works

3. Listening test for perceptual music similarity evaluation

3.1 Evaluation procedure for one sample set

3.2 Sample selection

3.3 Setup of the listening test

3.4 Response aggregation

4. Analysis

4.1 Impact of each instrumental part on track-level similarity

4.2 Performance evaluation of the deep learning models

4.2.1 Models to be evaluated.

4.2.1.1 Our previous developed model.

4.2.1.2 MERT with average poolings.

4.2.2 Performance evaluation procedure.

4.2.3 Results.

4.3 Similarity within and between music tracks

5. Conclusion

Appendix 1. Pseudo musical piece

Appendix 2. Consideration for the melody of drums

Notes

References

Email Alerts

Cited By

Investigation of part-level perceptual music similarity by large-scale listening test Open Access

1. Introduction

2. Related works

3. Listening test for perceptual music similarity evaluation

3.1 Evaluation procedure for one sample set

3.2 Sample selection

3.3 Setup of the listening test

3.4 Response aggregation

4. Analysis

4.1 Impact of each instrumental part on track-level similarity

4.2 Performance evaluation of the deep learning models

4.2.1 Models to be evaluated.

4.2.1.1 Our previous developed model.

4.2.1.2 MERT with average poolings.

4.2.2 Performance evaluation procedure.

4.2.3 Results.

4.3 Similarity within and between music tracks

5. Conclusion

Appendix 1. Pseudo musical piece

Appendix 2. Consideration for the melody of drums

Notes

References

Email Alerts

Suggested Reading

Recommended for you

Cited By

Investigation of part-level perceptual music similarity by large-scale listening test