This study investigates perceptual similarity at two levels: music tracks (track-level) and the individual instrumental parts that compose them (part-level). A previous work performed a study on perceptual part-level similarity toward developing a model that estimates part-level similarity. An ABX-style listening test with 632 participants was conducted, which evaluated similarity at both levels from the perspectives of timbre, melody, rhythm and overall. Although a previous work contributed some knowledge from the evaluations, further insights are needed to support the development of future estimation models. Specifically, important questions remain regarding the correspondence between track- and part-level similarity, the generalizability of findings across multiple models, and the validity of the conventional learning method in terms of perceptual similarity. This study revealed the following key findings: (1) the instrumental parts that predominantly affect the track-level similarity differ across music triplets and listeners, with the influence of the differences across music triplets exceeding the differences across listeners, indicating that part-level similarity helps in estimating track-level similarity; (2) when a temporal averaging is applied, the output of the deep learning models shows a closer correspondence with the perceptual evaluation based on timbre than on rhythm, indicating a potential area for improvement in the models; (3) the similarity between temporally distinct segments within the same music track is significantly perceived to be significantly higher than that between segments from different tracks, which supports the assumption of the conventional unsupervised learning method developed for music similarity estimation.
1. Introduction
Measuring music similarity is an important research topic owing to its critical role in music retrieval and recommendation systems. Such systems require similarity measures that reflect listeners’ perceptual similarity (Berenzweig et al., 2003; Ellis et al., 2002). However, it is difficult to label all of the huge music collections. Therefore, many works on techniques for automatically estimating music similarity have been presented, including approaches based on Gaussian mixture models (Aucouturier and Pachet, 2002; Pampalk, 2006), string-based representations (Casey and Slaney, 2006), sampling with collaborative filtering (McFee et al., 2012), binary code representations (Schlüter, 2013), transfer learning (Hamel et al., 2013), metric learning (Wolff Daniel, 2014), path-based methods (Gabbolini and Bridge, 2021) and perception-related acoustic features identified from user surveys (Cheng et al., 2020), as well as the analysis of the reliability of similarity evaluation (Urbano et al., 2011).
We proposed deep metric learning methods for estimating music similarity focusing on each instrumental part (part-level similarity) (Hashizume et al., 2025, 2022), in contrast to previous studies that targeted the similarity of entire music tracks (track-level similarity). Our previous works aimed to achieve flexible music retrieval and recommendation that allow users to choose the point to focus on. We used deep learning, which has become a standard approach for automatic music feature estimation and has shown strong performance in recent years. A variety of data-driven methods for track-level music feature estimation have been developed using available labels, including genre (Choi et al., 2019; Elbir and Aydin, 2020; Fathollahi and Razzazi, 2021; Hamel and Eck, 2010; Li et al., 2010), artist information (Cleveland et al., 2020; Lee et al., 2019; Park et al., 2018) and human annotation (Lu et al., 2017). Moreover, recent studies have shown that deep metric learning is effective for learning meaningful music representations (Lee et al., 2020, 2019; Lu et al., 2017; Park et al., 2018; Prétet et al., 2020). However, unlike previous studies, our problem setting lacks labels suitable for training, that is, part-level similarity labels. To address this issue, we used an unsupervised learning method with a loss function that makes the distance between temporally distinct segments within an instrumental part of a music track small, which benefits from being capable of learning without labeled data. One limitation of this method is that the correspondence of the learned similarity measure to the perceptual similarity measure is unclear, as perceptual music similarity is not used as an explicit target during training.
To investigate the correspondence between the estimated music similarity and perceptual music similarity, subjective evaluations obtained from a listening test are required. For many years, work has been done on perceptual music similarity through listening tests. Studies that clarify the factors listeners prioritize when perceiving music (Eerola et al., 2001; Mcadams and Matzkin, 2001; Novello et al., 2011) contribute to the design of methods that automatically estimate similarity based on listeners’ perception. Addressing how to aggregate responses from multiple participants (Eerola et al., 2001; Mcadams and Matzkin, 2001; Novello et al., 2011) is essential for constructing reliable ground truth labels for model evaluation. However, no listening tests have been conducted in which participants evaluate the part-level music similarity.
To address this limitation, we conducted an ABX-style listening test for the part-level similarity with 632 participants and collected 26,898 responses in our previous work (Hashizume and Toda, 2025). The perceptual similarity was evaluated from four perspectives: timbre, rhythm, melody and overall, with overall denoting a comprehensive assessment integrating the other three. We analyzed the results of the listening test and found the following:
The relative perceptual music similarity among three tracks (i.e. the similarity comparison between an X-A pair and an X-B pair) varies depending on the instrumental part that listeners focus on.
Rhythm and melody tend to have a larger impact on perceptual music similarity for each instrumental part than timbre except for the melody of drums.
Our previously developed model (Hashizume et al., 2025) mainly captures timbre-related similarity. Furthermore, the responses collected in the ABX test were released as the dataset [1].
Although our previous work contributed some knowledge, further insights are needed to support the development of future estimation models. While we showed that relative similarity between musical tracks depends on the instrumental part to focus on, how entire track-level similarity is determined remains unexplored. In addition, while our previous work evaluated our own model, it did not include evaluations across multiple models, leaving general conclusions yet to be addressed. Furthermore, although we collected evaluations under the same sample selection manner as in our unsupervised learning framework, these evaluations have not yet been analyzed.
In this study, we aim to investigate part-level similarity from the following aspects through our analysis:
which instrumental parts predominantly affect track-level perceptual music similarity;
the correspondence between the output of the deep learning models and perceptual similarity in terms of timbre, rhythm and melody; and
the validity of the assumption underlying conventional unsupervised learning methods for music similarity estimation, namely, that the similarity between temporally distinct segments within the same track is higher than that between segments from different tracks.
2. Related works
One of the objectives of our study is to clarify the perception of part-level similarity to support the development of part-level similarity estimation models. Several studies have previously analyzed the perception of music similarity and contributed to the estimation of track-level perceptual similarity. They have clarified which factors influence similarity perception, for example, the impact of rhythm compared to pitch (Mcadams and Matzkin, 2001), the role of frequency-based descriptors (Eerola et al., 2001) and the effects of genre, tempo and timbre (Novello et al., 2011).
Another objective of our study is to create an evaluation data set for part-level similarity estimation models based on subjective evaluations obtained from multiple participants. Similar efforts have been made in several studies for track-level similarity. Several studies have constructed ground truth data sets for music similarity, using both nonexpert subjective evaluations and expert assessments. For example, Ellis et al. (2002) and Berenzweig et al. (2003) collected large-scale evaluations through a listening test to estimate artist similarities. Typke et al. (2005) and Novello et al. (2011) built ground truth data sets for popular music similarity, revealing high consistency in perceptual evaluations across listeners. Müllensiefen and Frieler (2007) and Volk and van Kranenburg (2012) further provided annotated data sets focusing on melodic similarity.
The above studies focused on track-level perceptual similarity. Research closer to our part-level similarity work has conducted listening tests using individual instrumental sounds as stimuli (Lakatos, 2000; McAdams et al., 1995). These studies mainly conducted listening tests focusing on musical timbre and showed that perceptual similarity evaluations are consistent among listeners and relate to acoustic features of the instruments.
The studies mentioned above have all provided important insights into music perception and are closely related to our work. However, these studies do not cover the analysis or data set construction of part-level perceptual similarity together with the track-level similarity, using audio of individual instrumental parts and that of the original music tracks composed of them.
3. Listening test for perceptual music similarity evaluation
In this section, we describe the listening test conducted in our previous work (Hashizume and Toda, 2025). The listening test was conducted on the web page shown in Figure 1 with the participants recruited through crowdsourcing. To investigate the part-level perceptual music similarities, we conducted a listening test based on the ABX test on the similarity of sample sets consisting of instrumental parts and original music tracks. Participants provide responses for each sample set, which refers to a set of three audio samples labeled A, B and X. A collection of sample sets that participants respond to in a single listening test session is called an evaluation set. For each sample set, participants provide responses from four perspectives: timbre, rhythm, melody, and overall for each sample set. These perspectives were determined based on previous findings indicating that music similarity is influenced by genre, tempo and timbre (Novello et al., 2011), by rhythm and pitch (Mcadams and Matzkin, 2001) and by melodic (Eerola et al., 2001; Volk and van Kranenburg, 2012) and timbral (Lakatos, 2000; McAdams et al., 1995) characteristics. Each sample set consisted of either three different samples from the same instrument category (e.g. all three are drum sounds) or three different music tracks. One evaluation set included instrumental parts from five different categories (drums, bass, piano, guitar, others) and original music tracks.
Interface showing Reference A and B playback controls with pause and play buttons, followed by rating sections for timber rhythm melody and total, each offering A plus A minus N slash A B minus and B plus options, with an instruction in English asking users to choose their response according to the above explanation and navigation buttons labelled Previous and Next below.Image of the web page used in the listening test. Each sample set has play buttons for three samples and response buttons for four perspectives: timbre, rhythm, melody and the total similarity (i.e. the overall similarity across these three perspectives; hereinafter referred to as “overall” in the paper). In this figure, “Reference” corresponds to X as used in this paper. The actual screen shows the instructions for the procedure of the listening test above this. In addition to the content described in Section 3.1, the instructions state that participants are required to listen to each sample from the beginning to the end. Furthermore, the “Next” button remains disabled until all samples have been played and responses satisfying the specified constraints (will be described in Section 3.1) have been selected
Interface showing Reference A and B playback controls with pause and play buttons, followed by rating sections for timber rhythm melody and total, each offering A plus A minus N slash A B minus and B plus options, with an instruction in English asking users to choose their response according to the above explanation and navigation buttons labelled Previous and Next below.Image of the web page used in the listening test. Each sample set has play buttons for three samples and response buttons for four perspectives: timbre, rhythm, melody and the total similarity (i.e. the overall similarity across these three perspectives; hereinafter referred to as “overall” in the paper). In this figure, “Reference” corresponds to X as used in this paper. The actual screen shows the instructions for the procedure of the listening test above this. In addition to the content described in Section 3.1, the instructions state that participants are required to listen to each sample from the beginning to the end. Furthermore, the “Next” button remains disabled until all samples have been played and responses satisfying the specified constraints (will be described in Section 3.1) have been selected
3.1 Evaluation procedure for one sample set
As shown in Figure 1, participants were presented with three audio samples, X, A and B, and listened to all of them. They chose A+ if they perceived A to be more strongly similar to X than B, A− if slightly, B+ if they perceived B to be more strongly similar to X than A, and B− if slightly, on the basis of the following four perspectives: timbre, rhythm, melody, and overall.
In each response, the participants were allowed to select N/A from up to two perspectives, except for overall. The following instruction was provided to the participants as cases where N/A can be selected: “A and B are similar/dissimilar to X of equal degree” or “The presented instrumental track has no element corresponding to the perspective, e.g. drums have no melody.”
3.2 Sample selection
An overview of sample selection is shown in Figure 2. We created two types of sample set: “” and “”. For , we randomly selected three different music tracks from the test set of the Slakh2100 data set. This combination is referred to as a music triplet. We randomly captured five-second segments from each instrumental part contained in the three tracks of that music triplet. Five-second segments were also randomly extracted from each original music track. Then, we obtained one sample set for each instrument or music track, {X, A, B} = . The subscript j represents each category of instrument (), where 0 represents drums, 1 bass, 2 piano, 3 guitar, 4 others, and 5 original music track. This selection was repeated four times , and 24 sample sets were created (6 categories × 4 repetitions).
Diagram titled Evaluation set n illustrates construction of sample set i j for chi alpha beta and chi chi prime gamma, where chi i alpha i and beta i are grouped and short segments of length 5 are selected using indexed time intervals t sub i j to t sub i j plus 5 to form chi i j alpha i j and beta i j, and similarly chi i chi prime i and gamma i are segmented into chi i j chi prime i j and gamma i j, with an example j equals 2 Piano indicated and repetition across 4 values of i from 0 to 3.How to create one evaluation set. Examples of how to make one sample set for the piano part are also shown with a red line for and . Sample sets are created in the same manner as for other instrumental parts and the original music track, and the procedure is repeated four times
Diagram titled Evaluation set n illustrates construction of sample set i j for chi alpha beta and chi chi prime gamma, where chi i alpha i and beta i are grouped and short segments of length 5 are selected using indexed time intervals t sub i j to t sub i j plus 5 to form chi i j alpha i j and beta i j, and similarly chi i chi prime i and gamma i are segmented into chi i j chi prime i j and gamma i j, with an example j equals 2 Piano indicated and repetition across 4 values of i from 0 to 3.How to create one evaluation set. Examples of how to make one sample set for the piano part are also shown with a red line for and . Sample sets are created in the same manner as for other instrumental parts and the original music track, and the procedure is repeated four times
For , we used the same X as in , and one of the other two samples was taken from temporally distinct segments of the same music track as X. Namely, we randomly selected a music track excluding and replaced and with and , respectively, and the process was repeated in the same manner as for . The 24 sample sets were created.
Then, 48 sample sets were created as one evaluation set. This procedure was repeated with random sample selection, resulting in 60 evaluation sets. Hereafter, time indices are omitted for simplicity. To distinguish from , we denote the latter as .
3.3 Setup of the listening test
The presented audio samples were 136 music tracks and the instrumental parts that compose them. They were the tracks remaining from the 151 tracks in the test set of the redux subset of the Slakh2100 data set (Manilow et al., 2019), excluding tracks that do not contain enough instrumental parts. Fifty sample sets, consisting of 48 sample sets created as explained in Section 3.2 and two dummy sample sets, were shuffled and presented to the participants individually. The participants were not informed whether the sample set presented was , , or a dummy. The dummy sample sets were of two types: A is exactly the same audio as X, and B is exactly the same audio as X. In , one of the two orderings, {, } or {, }, was randomly assigned to {A, B}. Since this study is aimed at establishing a system for use by general users rather than experts, listening tests were conducted by recruiting participants through a crowdsourcing platform (CrowdWorks, 2025). Participation was fully voluntary and anonymous, and no personally identifiable information was collected. Participants were free to withdraw from the experiment at any time. The average duration of a single session, in which one evaluation set was assessed, was approximately 30 min, and participants were compensated according to the standard rates of the platform.
3.4 Response aggregation
The following responses were excluded as they were inappropriate for analysis: all responses from participants who did not select the same sample as X in the dummy tests; responses that took less than 15 s including listening time; responses with blanks in the technical problems; and duplicate responses from the same participant within each evaluation set. After excluding them, by counting the set of evaluations on timbre, rhythm, melody and overall as a single response, we obtained 26,898 valid responses from a total of 632 participants (281 unique participants). For all 60 evaluation sets, valid responses from at least six different participants were obtained. Note that this count also includes responses given only for instrumental parts. This is because the listening test was conducted in two phases: in the first phase, only evaluations of individual instrumental parts were collected to obtain instrument labels, namely, responses to 42 sample sets (5 categories × 4 repetitions × 2 types, plus 2 dummy sets); in the second phase, participants responded to complete evaluation sets, which consisted of 50 sample sets including the original music tracks. The total number of valid participants who responded to the complete evaluation sets was 346 (167 unique participants); the total number of valid responses to the complete evaluation sets was 16317. Figure 3 shows a histogram of the number of participants providing valid responses and the number of valid responses to each evaluation set.
Four bar charts display distribution across set index from 0 to about 58, where charts a and c show number of participants per set mostly ranging from about 4 to 16 with moderate fluctuation and no extreme spikes, and charts b and d show number of responses per set ranging roughly from 200 to 700 with noticeable peaks around higher set indices and broader variability compared with participant counts.Number of participants providing valid responses and number of valid responses to each evaluation set. (a) Number of all participants providing valid responses, (b) number of all valid responses, (c) number of participants who provided valid responses to the complete evaluation sets, (d) number of valid responses to the complete evaluation sets
Four bar charts display distribution across set index from 0 to about 58, where charts a and c show number of participants per set mostly ranging from about 4 to 16 with moderate fluctuation and no extreme spikes, and charts b and d show number of responses per set ranging roughly from 200 to 700 with noticeable peaks around higher set indices and broader variability compared with participant counts.Number of participants providing valid responses and number of valid responses to each evaluation set. (a) Number of all participants providing valid responses, (b) number of all valid responses, (c) number of participants who provided valid responses to the complete evaluation sets, (d) number of valid responses to the complete evaluation sets
In addition, the number of strong responses (A+ or B+), the number of weak responses (A− or B−), and the number of N/A responses are shown in Figure 4. In the case of drums, there were many N/A responses for melody. This is assumed to be because many participants evaluated that “drums have no melody,” which was given as an example in the instructions.
Two heatmaps labelled chi alpha beta and chi chi prime gamma present response counts for plus minus and N slash A across drums bass piano guitar others and music for timbre melody and rhythm, where chi alpha beta shows dominant minus counts often above 800 to 900 across most instrument feature pairs with plus generally between about 150 and 300 and N slash A around 250 to 400, while chi chi prime gamma shows dominant plus counts often between about 700 and 900 with minus typically around 400 to 600 and N slash A mostly below 200 indicating a shift towards positive responses.Heatmaps showing the number of strong responses (A+ or B+), the number of weak responses (A− or B−) and the number of N/a responses for each instrument and each perspective. Note that only the responses to the complete evaluation set are shown here to enable comparison between the original music tracks and each instrumental part
Two heatmaps labelled chi alpha beta and chi chi prime gamma present response counts for plus minus and N slash A across drums bass piano guitar others and music for timbre melody and rhythm, where chi alpha beta shows dominant minus counts often above 800 to 900 across most instrument feature pairs with plus generally between about 150 and 300 and N slash A around 250 to 400, while chi chi prime gamma shows dominant plus counts often between about 700 and 900 with minus typically around 400 to 600 and N slash A mostly below 200 indicating a shift towards positive responses.Heatmaps showing the number of strong responses (A+ or B+), the number of weak responses (A− or B−) and the number of N/a responses for each instrument and each perspective. Note that only the responses to the complete evaluation set are shown here to enable comparison between the original music tracks and each instrumental part
4. Analysis
In this section, we examine three questions through analysis:
Which instrumental parts are dominant in determining the track-level music similarity?
Is there a correspondence between the similarity criteria output by the deep learning models and perceptual music similarity in terms of timbre, rhythm and melody?
Is the similarity between temporally distinct segments within the same music track perceived as higher than that between segments from different tracks?
In this paper, unless otherwise noted, we define “A+” and “A−” as the same response, and “B+” and “B−” as the same response; in other words, the response is treated as A, B or N/A. (For overall, the response is either A or B).
4.1 Impact of each instrumental part on track-level similarity
In this section, we examine which instrumental parts predominantly contribute to the determination of the track-level music similarity, as in Figure 5. (Here, we focus only on ). As shown in our previous work (Hashizume and Toda, 2025), the perceived similarity between music tracks varies depending on the instrumental part being focused on (Figure 6). This result raises the question of which instrumental parts listeners prioritize when determining the track-level similarity. For example, as in Figure 5, if a participant perceives that music track A is more similar to X for the bass, piano, and other parts, but music track B is more similar to X for the drums and guitar, and evaluated that music track B is more similar to X for the original music track (i.e. mixed sounds), then we can consider that the drums and guitar are dominant in that music triplet for that participant.
Diagram showing three sequences labelled A X and B each linked to sets of instrument icons including drums guitar piano electric guitar trumpet and violin, where selected instruments are emphasised within each sequence to illustrate comparison across corresponding positions between A X and B, and stars below mark key comparison points across the sequences.Highlight the pair that a participant perceives as more similar when they listen to the instrumental part sets that compose the same music triplet. In this example, the drums and guitar are considered dominant because their evaluations are consistent with the track-level similarity
Diagram showing three sequences labelled A X and B each linked to sets of instrument icons including drums guitar piano electric guitar trumpet and violin, where selected instruments are emphasised within each sequence to illustrate comparison across corresponding positions between A X and B, and stars below mark key comparison points across the sequences.Highlight the pair that a participant perceives as more similar when they listen to the instrumental part sets that compose the same music triplet. In this example, the drums and guitar are considered dominant because their evaluations are consistent with the track-level similarity
Triangular heatmap showing similarity values from 0.0 to 1.0 across instruments D r, B a, P i, G u, O t, M u, with the diagonal at 1.0 indicating perfect self similarity and off diagonal values mostly around 0.40 to 0.60 showing moderate similarity between different instruments, with slightly higher similarity between P i and G u and lower similarity between more distinct pairs such as D r and M u.Our previous work (Hashizume and Toda, 2025) showed that the relative perceptual music similarity among three tracks (i.e. the similarity comparison between an X-A pair and an X-B pair) varies depending on the instrumental part being listened to. In the example shown in Figure 5, the drum and guitar are matching, whereas the drum and piano are mismatching. This figure summarizes all responses from all participants across all music triplets and shows the matching rate for each pair of instruments (in )
Triangular heatmap showing similarity values from 0.0 to 1.0 across instruments D r, B a, P i, G u, O t, M u, with the diagonal at 1.0 indicating perfect self similarity and off diagonal values mostly around 0.40 to 0.60 showing moderate similarity between different instruments, with slightly higher similarity between P i and G u and lower similarity between more distinct pairs such as D r and M u.Our previous work (Hashizume and Toda, 2025) showed that the relative perceptual music similarity among three tracks (i.e. the similarity comparison between an X-A pair and an X-B pair) varies depending on the instrumental part being listened to. In the example shown in Figure 5, the drum and guitar are matching, whereas the drum and piano are mismatching. This figure summarizes all responses from all participants across all music triplets and shows the matching rate for each pair of instruments (in )
We calculated the matching rates with the track-level evaluation for each instrumental part evaluation, and Figure 7 shows the results for each music triplet and for each participant, along with dendrograms obtained using Ward’s hierarchical clustering (Ward, 1963). In Figure 7(a), light colors indicate that participants were divided in their evaluation of the dominant part. Dark colors represent a high level of agreement among participants: the red indicates that the participants perceived the part as dominant, whereas the blue indicates that they perceived it as not dominant. In Figure 7(b), light colors indicate that the participant’s evaluation changes depending on the music triplet. Dark colors indicate a high level of consistency in a participant’s evaluations across different music triplets: the red indicates that the participant consistently perceived the part as dominant, whereas the blue indicates that the part was consistently perceived as not dominant. It can be seen that the dominant instrumental parts vary across both music triplets and participants. Clusters that corresponded to patterns associated with particular combinations of instruments were formed. For example, in (a), when the dendrogram is cut to yield four clusters, the bottom cluster represents a group of music triplets where the other part is consistently disregarded among participants. Furthermore, this cluster is divided at the next branch into two subclusters: one in which the bass part is regarded, and another in which it is disregarded. We also present a two-dimensional plot obtained by applying t-SNE for dimensionality reduction, with colors indicating clusters identified using Ward’s method in Figure 8. These results suggest that the instrumental parts that listeners focus on can be grouped according to either the music triplet or the listener.
Two clustered heatmaps present normalised response values from 0.00 to 1.00, where the left heatmap organises rows by music triplet index and columns bass others drums piano guitar showing distinct blocks of higher values concentrated for specific instruments within clusters, and the right heatmap organises rows by participant index and columns piano others guitar drums bass revealing participant level variability with some clusters favouring particular instruments while others show mixed or contrasting response patterns.The heatmaps show the matching rates between track-level evaluations and the evaluations for each instrumental part, represented as a five-dimensional vector. (a) The matching rates calculated across all participants who responded to one music triplet, and (b) the matching rates calculated across all music triplets responded to by one participant. In this figure, the vectors are reordered based on hierarchical clustering using Ward’s method (Ward, 1963), with dendrograms displayed on the left and top sides. Horizontally, clustering is performed by instrumental parts. Vertically, clustering is performed by music triplets in (a) and by participants in (b)
Two clustered heatmaps present normalised response values from 0.00 to 1.00, where the left heatmap organises rows by music triplet index and columns bass others drums piano guitar showing distinct blocks of higher values concentrated for specific instruments within clusters, and the right heatmap organises rows by participant index and columns piano others guitar drums bass revealing participant level variability with some clusters favouring particular instruments while others show mixed or contrasting response patterns.The heatmaps show the matching rates between track-level evaluations and the evaluations for each instrumental part, represented as a five-dimensional vector. (a) The matching rates calculated across all participants who responded to one music triplet, and (b) the matching rates calculated across all music triplets responded to by one participant. In this figure, the vectors are reordered based on hierarchical clustering using Ward’s method (Ward, 1963), with dendrograms displayed on the left and top sides. Horizontally, clustering is performed by instrumental parts. Vertically, clustering is performed by music triplets in (a) and by participants in (b)
Two scatter plots display two dimensional embeddings with axes spanning approximately minus 15 to 15 on the left and minus 10 to 10 on the right, where coloured points representing 32 indexed groups form dispersed clusters in circular arrangements, the music triplet view showing broader spread with several tight local clusters and clearer separation between some groups, and the participant view showing more overlap between coloured groups with moderate clustering and less distinct separation overall.
Two scatter plots display two dimensional embeddings with axes spanning approximately minus 15 to 15 on the left and minus 10 to 10 on the right, where coloured points representing 32 indexed groups form dispersed clusters in circular arrangements, the music triplet view showing broader spread with several tight local clusters and clearer separation between some groups, and the participant view showing more overlap between coloured groups with moderate clustering and less distinct separation overall.Two-dimensional plots of the five-dimensional vectors shown in Figure 7 reduced using t-SNE. The colors indicate the 32 clusters obtained by cutting the dendrogram from Ward’s hierarchical clustering, as presented in Figure 7
In addition, comparing the two heatmaps in Figure 7(a): based on music triplets and (b): based on participants, we can see that (a) contains more darker-colored cells. This suggests that the evaluations of the same music triplet show high agreement across participants in terms of whether the instrumental part is perceived as dominant. To quantify this, we tested whether there is a significant difference between the variability of evaluations across different participants for the same music triplet and the variability of evaluations across different music triplets from the same participant. Specifically, for each instrumental part, we first construct data in which evaluations were coded as 1 if the track- and part-level evaluations matched, and 0 if they did not. Then, we examined whether the mean of the variances of the data within each music triplet was smaller than the mean of the variances of the data within each participant, using the Mann–Whitney U test (Mann and Whitney, 1947). The results are summarized in Table 1. For all instruments, the within-triplet variance was lower than the within-participant variance on average. Although the effect sizes were small, the difference was statistically significant () for all instruments except the piano. This indicates a consistent tendency across data rather than a large separation between the two distributions. In other words, the tendency for different listeners to perceive similarly to a given music triplet is stronger than the tendency for the same listener to consistently focus on a particular instrument. This result implies that which instrumental parts are perceived as dominant is more likely to depend on music triplets than on individuals. Taken together with all of this section’s results, certain instrumental parts have a strong influence on music similarity, and while there are individual differences, there is a considerable level of agreement across multiple listeners. Therefore, estimating which instruments are dominant and leveraging this information in track-level similarity estimation is expected to improve the accuracy of general track-level similarity measures.
The mean of the variances of the data within each participant (averaged across participants) and the mean of the variances of the data within each music triplet (averaged across music triplets), where the data represent binary values (1 or 0) indicating whether the evaluations matched the track-level evaluations
| Statistic | Drums | Bass | Piano | Guitar | Others |
|---|---|---|---|---|---|
| Mean variance in a triplet | 0.21496 | 0.22861 | 0.22288 | 0.22473 | 0.22137 |
| Mean variance in a participant | 0.25918 | 0.24173 | 0.23684 | 0.25223 | 0.24672 |
| p-value | 0.00000 | 0.00906 | 0.09272 | 0.00001 | 0.00170 |
| Effect size | 0.252 | 0.125 | 0.070 | 0.225 | 0.154 |
| Statistic | Drums | Bass | Piano | Guitar | Others |
|---|---|---|---|---|---|
| Mean variance in a triplet | 0.21496 | 0.22861 | 0.22288 | 0.22473 | 0.22137 |
| Mean variance in a participant | 0.25918 | 0.24173 | 0.23684 | 0.25223 | 0.24672 |
| p-value | 0.00000 | 0.00906 | 0.09272 | 0.00001 | 0.00170 |
| Effect size | 0.252 | 0.125 | 0.070 | 0.225 | 0.154 |
The p-values for testing whether the former is smaller than the latter based on the Mann–Whitney U test, as well as the absolute effect sizes (|r|), are also shown. Since the number of evaluations per music triplet and per participant differed in range, we filtered the data to include only music triplets and participants with between 4 and 9 evaluations, where the ranges overlapped. After this filtering, the average number of evaluations per music triplet was 5.74, and the average number of evaluations per participant was 4.68. We calculated the means after the filtering
4.2 Performance evaluation of the deep learning models
In our previous work (Hashizume and Toda, 2025), we assessed the performance of our model (Hashizume et al., 2025) using evaluations obtained in the listening test. We showed that our model (Hashizume et al., 2025) mainly captures timbral features. Our work also showed that rhythm and melody are prioritized over timbre in perception (Table 2), highlighting a gap that points to potential areas for improving our method. However, evaluating the performance of a single model provided only limited insight. Therefore, in this study, we conducted the same evaluation on the large-scale self-supervised learning model, MERT (Li et al., 2024), to gain further insight.
The matching rate with overall for each perspective [%] [shown in our previous work (Hashizume and Toda, 2025)]
| Perspectives | Drums | Bass | Piano | Guitar | Others | Music |
|---|---|---|---|---|---|---|
| Timbre | 74.7 | 74.5 | 75.6 | 75.1 | 75.3 | 74.3 |
| Rhythm | 84.4 | 80.1 | 78.4 | 80.6 | 79.8 | 82.0 |
| Melody | 56.7 | 79.0 | 80.7 | 81.0 | 83.1 | 82.8 |
| Perspectives | Drums | Bass | Piano | Guitar | Others | Music |
|---|---|---|---|---|---|---|
| Timbre | 74.7 | 74.5 | 75.6 | 75.1 | 75.3 | 74.3 |
| Rhythm | 84.4 | 80.1 | 78.4 | 80.6 | 79.8 | 82.0 |
| Melody | 56.7 | 79.0 | 80.7 | 81.0 | 83.1 | 82.8 |
For a given sample set, the perspectives for which a participant’s evaluations matched the overall evaluation were considered to be given priority, and thus, perspectives with higher matching rates were interpreted as more important in perception. Note that N/A is included in counting and treated as a mismatch with both A and B
4.2.1 Models to be evaluated.
4.2.1.1 Our previous developed model.
We developed an instrumental-part-level similarity learning method with a single network that takes original music tracks (i.e. mixed sounds) as input (Hashizume et al., 2025). We designed a similarity embedding space with separated subspaces for each instrumental part using conditional similarity networks (Veit et al., 2017). To separate the embedding space, a mask is applied to all dimensions except for the dimension corresponding to the notion to be considered in the triplet loss calculation. That is, for example, when learning the drum space, a binary mask that retains only the dimensions assigned to drums is applied, and the distance between the masked feature representations is used for metric learning with the triplet loss. When learning the drum space, it is necessary to sample data based on (dis)similarity, focusing on the drum part. To address this, we introduced pseudo musical piece, which extends unsupervised learning under the assumption that segments within the same track are similar (see Appendix 1 for details). The model input was a mel-spectrogram converted from the original music track signal, and the output was a 640-dimensional vector in which five instruments (drums, bass, piano, guitar and others) were each assigned a 128-dimensional subspace. The model was trained using the training data from the Slakh2100 data set (Manilow et al., 2019). The network consisted of ten convolutional layers, followed by temporal average pooling and a fully connected layer.
4.2.1.2 MERT with average poolings.
We also conducted a performance evaluation using MERT (Li et al., 2024), a pretrained model based on self-supervised learning that can extract musical features. Unlike the model developed in our previous work (Hashizume et al., 2025), MERT does not have a mechanism for extracting the distinct features of each instrumental part from original music tracks, so we input the instrumental part signals instead. We downsampled the signals to 24 kHz, extracted 3D tensors with 25 layers × T time steps × 1024 feature dimensions from the intermediate layers of the pre-trained model [2]. We then applied average pooling along the layer and time axes to obtain 1024-dimensional representations. Noted that in this type of framework, where individual instrumental signals are used as inputs, it is generally necessary to use estimated signals obtained from source separation of the original music track, as in the previous work (Hashizume et al., 2022), because true individual instrumental signals are typically not available at inference. However, in this experiment, to eliminate the influence of separation performance and simplify the discussion, we used estimated signals at their ideal upper-bound performance, that is, stems, as inputs.
4.2.2 Performance evaluation procedure.
We calculated whether A or B was closer to X using the models for the same set as used in this listening test and then calculated the matching rate between the models’ results and the subjective evaluation results, as shown in Figure 9. The calculation of distances using the model was performed as follows: The music tracks originally containing the instrumental parts used in the listening test were input into the model, the instrumental-part level feature representations were extracted from them, and then the distances were measured between them. Sample sets with low agreement among participants indicate that both A and B are equally (dis)similar to X, making them unsuitable for this performance evaluation method. Therefore, only evaluations with agreement equal to or higher than 80% among participants were used in the evaluation. This filtering focuses on reliable perceptual judgments by emphasizing high-consensus cases, while trading off broader sample coverage. To avoid excessively reducing the number of evaluation samples, we set the consensus threshold to 80% in this study, balancing reliability and coverage. Note that if N/A accounts for the largest percentage, that sample set was excluded from the evaluation set.
Diagram illustrating comparison between participant judgements and model predictions for drum parts, where music tracks A X and B contain drum components that are extracted and represented as feature points in a two dimensional space, participants judge that A is more similar while the model determines that A is closer based on feature distance, and a central match indicates alignment between human similarity judgement and model proximity.How to evaluate the models
Diagram illustrating comparison between participant judgements and model predictions for drum parts, where music tracks A X and B contain drum components that are extracted and represented as feature points in a two dimensional space, participants judge that A is more similar while the model determines that A is closer based on feature distance, and a central match indicates alignment between human similarity judgement and model proximity.How to evaluate the models
The numbers of unique sample sets and evaluations for them used for this performance evaluation are shown in Table 3. We can see that there is more agreement among participants in their evaluations in than in . Moreover, there are fewer sample sets for timbre, rhythm, and melody compared with overall because the participants cannot select N/A in overall, but can in the others, and here sample sets that received N/A from 80% or more of participants were omitted from the evaluation.
Number of unique sample sets and number of evaluations used for performance evaluation
| Perspectives | Drums | Bass | Piano | Guitar | Others |
|---|---|---|---|---|---|
| (a) Number of unique sample sets of | |||||
| Overall | 92 | 95 | 83 | 90 | 95 |
| Timbre | 49 | 44 | 37 | 42 | 43 |
| Rhythm | 56 | 47 | 26 | 33 | 38 |
| Melody | 4 | 41 | 21 | 25 | 47 |
| (b) Number of unique sample sets of | |||||
| Overall | 212 | 180 | 210 | 209 | 192 |
| Timbre | 212 | 192 | 214 | 206 | 197 |
| Rhythm | 201 | 164 | 192 | 192 | 185 |
| Melody | 211 | 177 | 203 | 200 | 188 |
| (c) Number of evaluations of | |||||
| Overall | 912 | 949 | 836 | 925 | 977 |
| Timbre | 463 | 420 | 330 | 412 | 414 |
| Rhythm | 517 | 433 | 233 | 317 | 355 |
| Melody | 32 | 363 | 186 | 238 | 437 |
| (d) Number of evaluations of | |||||
| Overall | 2143 | 1792 | 2082 | 2096 | 1926 |
| Timbre | 1926 | 1745 | 2044 | 1993 | 1892 |
| Rhythm | 1983 | 1555 | 1800 | 1819 | 1721 |
| Melody | 1391 | 1589 | 1905 | 1898 | 1786 |
| Perspectives | Drums | Bass | Piano | Guitar | Others |
|---|---|---|---|---|---|
| (a) Number of unique sample sets of | |||||
| Overall | 92 | 95 | 83 | 90 | 95 |
| Timbre | 49 | 44 | 37 | 42 | 43 |
| Rhythm | 56 | 47 | 26 | 33 | 38 |
| Melody | 4 | 41 | 21 | 25 | 47 |
| (b) Number of unique sample sets of | |||||
| Overall | 212 | 180 | 210 | 209 | 192 |
| Timbre | 212 | 192 | 214 | 206 | 197 |
| Rhythm | 201 | 164 | 192 | 192 | 185 |
| Melody | 211 | 177 | 203 | 200 | 188 |
| (c) Number of evaluations of | |||||
| Overall | 912 | 949 | 836 | 925 | 977 |
| Timbre | 463 | 420 | 330 | 412 | 414 |
| Rhythm | 517 | 433 | 233 | 317 | 355 |
| Melody | 32 | 363 | 186 | 238 | 437 |
| (d) Number of evaluations of | |||||
| Overall | 2143 | 1792 | 2082 | 2096 | 1926 |
| Timbre | 1926 | 1745 | 2044 | 1993 | 1892 |
| Rhythm | 1983 | 1555 | 1800 | 1819 | 1721 |
| Melody | 1391 | 1589 | 1905 | 1898 | 1786 |
Only sample sets with an agreement rate of 80% or higher among participants and evaluations for them were included
4.2.3 Results.
The performance evaluation results of our model are shown in Figure 10. Although is less accurate than , accuracy is higher in drums, piano, and guitar in the performance evaluation using evaluations focusing on timbre compared with using those focusing on overall. In guitar, accuracy is higher even in the melody from overall. Note that although the drum melody also shows a relatively high value, this result is based on a significantly smaller number of samples than for other results. Furthermore, we consider that this evaluation is based on the perceived similarity of drum patterns (see Appendix 2). In all instruments, rhythm tends to show an equal or lower accuracy than the other perspectives. This suggests that frequency-related features are captured by the model, but rhythmic features are not. This is considered to be due to the network architecture, which contains the temporal average pooling. The performance evaluation results of MERT with average poolings are shown in Figure 11. Note that, as mentioned in Section 4.2.1, since this performance was based on inputting ideal instrumental signals (stems), a direct comparison between this result and that of our previous work is not possible. On the other hand, similar trends are observed between the model in our previous work and MERT with temporal and channel-wise average pooling, namely that performance on rhythm is lower than on the other perspectives. These findings suggest that temporal average pooling in deep learning models leads to a loss of rhythm information in any instrumental part.
Two scatter plots show accuracy from 0.0 to 1.0 on the vertical axis against instruments Drums Bass Piano Guitar and Other on the horizontal axis for overall timbre rhythm and melody under chi alpha beta and chi chi prime gamma, with chi alpha beta showing moderate accuracy overall around 0.55 to 0.75 with melody often higher for Bass and Guitar reaching around 0.80 to 0.90 while Piano and Other remain closer to 0.55 to 0.70 and rhythm and timbre varying by instrument, whereas chi chi prime gamma shows consistently high accuracy across all instruments and measures clustered around 0.88 to 0.97 with minimal spread indicating near ceiling performance.Graphs of accuracy and the 95% confidence intervals of the previously developed model (Hashizume et al., 2025) for each instrument and each perspective. The 95% confidence intervals were calculated using the Clopper–Pearson method (Clopper and Pearson, 1934). Colors represent instruments, and symbols represent perspectives. This result was shown in our previous work (Hashizume and Toda, 2025)
Two scatter plots show accuracy from 0.0 to 1.0 on the vertical axis against instruments Drums Bass Piano Guitar and Other on the horizontal axis for overall timbre rhythm and melody under chi alpha beta and chi chi prime gamma, with chi alpha beta showing moderate accuracy overall around 0.55 to 0.75 with melody often higher for Bass and Guitar reaching around 0.80 to 0.90 while Piano and Other remain closer to 0.55 to 0.70 and rhythm and timbre varying by instrument, whereas chi chi prime gamma shows consistently high accuracy across all instruments and measures clustered around 0.88 to 0.97 with minimal spread indicating near ceiling performance.Graphs of accuracy and the 95% confidence intervals of the previously developed model (Hashizume et al., 2025) for each instrument and each perspective. The 95% confidence intervals were calculated using the Clopper–Pearson method (Clopper and Pearson, 1934). Colors represent instruments, and symbols represent perspectives. This result was shown in our previous work (Hashizume and Toda, 2025)
Two scatter plots display accuracy from 0.0 to 1.0 for Drums Bass Piano Guitar and Other comparing chi alpha beta and chi chi prime gamma across overall timbre rhythm and melody, with chi alpha beta showing noticeable variation where Drums range roughly from 0.50 to 0.85 Bass from about 0.65 to 0.85 Piano from about 0.60 to 0.85 Guitar from about 0.55 to 0.90 and Other from about 0.60 to 0.88 and melody frequently among the highest values especially for Bass and Guitar, while chi chi prime gamma shows tightly grouped high values between about 0.89 and 0.96 for all instruments and measures demonstrating stable and uniformly strong accuracy.Graphs of accuracy and the 95% confidence intervals (Clopper and Pearson, 1934) of MERT for each instrument and each perspective. Colors represent instruments, and symbols represent perspectives
Two scatter plots display accuracy from 0.0 to 1.0 for Drums Bass Piano Guitar and Other comparing chi alpha beta and chi chi prime gamma across overall timbre rhythm and melody, with chi alpha beta showing noticeable variation where Drums range roughly from 0.50 to 0.85 Bass from about 0.65 to 0.85 Piano from about 0.60 to 0.85 Guitar from about 0.55 to 0.90 and Other from about 0.60 to 0.88 and melody frequently among the highest values especially for Bass and Guitar, while chi chi prime gamma shows tightly grouped high values between about 0.89 and 0.96 for all instruments and measures demonstrating stable and uniformly strong accuracy.Graphs of accuracy and the 95% confidence intervals (Clopper and Pearson, 1934) of MERT for each instrument and each perspective. Colors represent instruments, and symbols represent perspectives
In conjunction with the finding in our previous work that listeners placed more importance on rhythm or melody than on timbre (Table 2), we consider that if we can design a model that captures the structure of the time direction so that rhythm can be considered, it will be possible to obtain a music similarity that is also compatible with listeners’ perception on overall. This could be further explored in future work by incorporating attention-based temporal modeling or by learning representations through downstream rhythm-related tasks (e.g. beat tracking), among other possible approaches.
4.3 Similarity within and between music tracks
In this section, we examine whether the similarity between temporally distinct segments within the same music track is perceived to be higher than that between different music tracks. We used the unsupervised learning method with triplet loss, assuming that the similarity between temporally distinct segments within the same music track is higher than that between different music tracks in our previous works (Hashizume et al., 2025, 2022). As shown in Figure 12, the participants compare two between-music similarities in , whereas they compare the within-music similarity with the between-music similarity in . The data selection setting in is the same as our unsupervised learning method. Table 4 shows the rate at which the segment from the same music track as X, namely, , was selected in . The 95% confidence intervals were calculated using the Clopper–Pearson method (Clopper and Pearson, 1934). Considering all evaluations, including N/A, the percentages of evaluations exceed 70% for all except the melody of drums. When N/A evaluations are excluded, the percentages of evaluations exceed 80% for all, including the melody of drums. The results show that the similarity between randomly extracted segments within the same music track is perceived to be significantly higher than that between different music tracks. This suggests that our previous method (Hashizume et al., 2025, 2022) is appropriate for realizing the goal of learning similarity that aligns with human perception. Furthermore, we can see from Figure 4 that the number of participants who perceived a strong similarity is higher for than for . These indicate that within-music similarity is perceived to be significantly higher than between-music similarity.
Diagram comparing chi alpha beta and chi chi prime gamma, each presenting sequences of waveform segments arranged horizontally, where in chi alpha beta 3 segments labelled A X B are placed in order and relationships are described as between segments across different positions, while in chi chi prime gamma relationships are described as within a single segment as well as between segments, with highlighted portions indicating the specific sound intervals being compared and a legend at the bottom clarifying the meaning of between and within comparisons.Explanation of between-music similarity and within-music similarity. In , one sample is a different segment of the same music track as X. Thus, one of the similarities to evaluate is the within-music similarity
Diagram comparing chi alpha beta and chi chi prime gamma, each presenting sequences of waveform segments arranged horizontally, where in chi alpha beta 3 segments labelled A X B are placed in order and relationships are described as between segments across different positions, while in chi chi prime gamma relationships are described as within a single segment as well as between segments, with highlighted portions indicating the specific sound intervals being compared and a legend at the bottom clarifying the meaning of between and within comparisons.Explanation of between-music similarity and within-music similarity. In , one sample is a different segment of the same music track as X. Thus, one of the similarities to evaluate is the within-music similarity
Percentage of evaluations selecting the segment from the same music track () in , with the 95% confidence intervals [%]
| Perspectives | Drums | Bass | Piano | Guitar | Others | Music |
|---|---|---|---|---|---|---|
| (a) Percentage ofamong all evaluations including N/A | ||||||
| Overall | 91.49–93.62 | 83.91–86.77 | 89.02–91.43 | 89.0–91.41 | 86.48–89.13 | 87.82–91.15 |
| Timbre | 82.4–85.37 | 78.16–81.39 | 86.24–88.91 | 84.8–87.59 | 83.12–86.04 | 79.57–83.76 |
| Rhythm | 86.94–89.55 | 72.87–76.37 | 79.64–82.79 | 80.71–83.8 | 76.53–79.87 | 81.73–85.73 |
| Melody | 57.89–61.85 | 73.33–76.81 | 82.37–85.34 | 82.04–85.04 | 79.05–82.24 | 81.96–85.94 |
| (b) Percentage ofamong evaluations excluding N/A | ||||||
| Timbre | 92.17–94.32 | 86.8–89.55 | 91.63–93.8 | 89.53–91.94 | 88.6–91.12 | 86.57–90.19 |
| Rhythm | 89.9–92.25 | 79.61–82.90 | 86.69–89.42 | 86.17–88.93 | 84.24–87.21 | 87.38–90.87 |
| Melody | 90.63–93.37 | 83.13–86.22 | 88.7–91.22 | 88.14–90.72 | 85.05–87.91 | 89.3–92.55 |
| Perspectives | Drums | Bass | Piano | Guitar | Others | Music |
|---|---|---|---|---|---|---|
| (a) Percentage of | ||||||
| Overall | 91.49–93.62 | 83.91–86.77 | 89.02–91.43 | 89.0–91.41 | 86.48–89.13 | 87.82–91.15 |
| Timbre | 82.4–85.37 | 78.16–81.39 | 86.24–88.91 | 84.8–87.59 | 83.12–86.04 | 79.57–83.76 |
| Rhythm | 86.94–89.55 | 72.87–76.37 | 79.64–82.79 | 80.71–83.8 | 76.53–79.87 | 81.73–85.73 |
| Melody | 57.89–61.85 | 73.33–76.81 | 82.37–85.34 | 82.04–85.04 | 79.05–82.24 | 81.96–85.94 |
| (b) Percentage of | ||||||
| Timbre | 92.17–94.32 | 86.8–89.55 | 91.63–93.8 | 89.53–91.94 | 88.6–91.12 | 86.57–90.19 |
| Rhythm | 89.9–92.25 | 79.61–82.90 | 86.69–89.42 | 86.17–88.93 | 84.24–87.21 | 87.38–90.87 |
| Melody | 90.63–93.37 | 83.13–86.22 | 88.7–91.22 | 88.14–90.72 | 85.05–87.91 | 89.3–92.55 |
5. Conclusion
In this study, we investigated how perceptual music similarity when focusing on individual instrumental parts is perceived and related to the track-level similarity. We conducted a large-scale listening test and obtained 26,898 responses from 632 participants from four perspectives, that is, timbre, rhythm, melody and overall in our previous work (Hashizume and Toda, 2025). We conducted an extended analysis and obtained the following insights:
The relative perceptual similarity among the three samples at the part-level sometimes corresponds to the similarity at the track (mixed sounds) level and sometimes does not. Which instrumental part is dominant varies across music triplets and listeners, where parts whose similarity agrees with the track-level similarity can be considered dominant in the perception of track-level similarity. Moreover, the variance of evaluations across participants for the same music triplet is smaller than the variance of evaluations within the same participant across different music triplets, suggesting that differences among music triplets have a stronger influence than individual differences among participants. This indicates that certain instrumental parts tend to be more salient for listeners’ perception depending on the music track.
Although rhythm and melody tend to have a larger impact on perceptual music similarity for each instrumental part than timbre, the deep learning models with temporal average pooling capture the features related to timbre compared to rhythm. This suggests that temporal average pooling leads to the loss of rhythmic information.
The similarity between temporally distinct segments within the same music track is perceived to be significantly higher than that between segments from different tracks, supporting the unsupervised learning approach used in our previous work (Hashizume et al., 2025, 2022).
While our experiments are conducted on Slakh2100, a MIDI-rendered data set with a limited set of instrumental categories, conducting the same experiments on real acoustic instruments and vocals could yield more generalizable insights and remains an important direction for future work. As further future work, feature extraction that explicitly accounts for the instrumental parts listeners tend to focus on for each track can be leveraged to improve the general accuracy of track-level similarity estimation. Also, developing feature extraction models using temporal pooling methods that account for rhythm could improve the performance of the part-level perceptual music similarity estimation.
Appendix 1. Pseudo musical piece
A triplet sampling method suitable for our method (Hashizume et al., 2025) should satisfy the following requirement: when learning a subspace corresponding to an instrument, an anchor and a positive sample are similar, and the anchor and a negative sample are dissimilar focusing on that instrument. When learning the guitar subspace, triplet music tracks selected on the basis of similarity when focusing on the guitar part should be input. To satisfy this requirement without using manual labels, we introduce an unsupervised learning method adapted from the method used in our previous work (Hashizume et al., 2022). We have proposed the use of pseudo musical pieces, which are created by mixing instrumental parts from different musical tracks. By combining the guitar part from music track A with the non-guitar parts from music tracks B and C, respectively, we create two pseudo musical pieces. By temporally dividing these tracks, we construct a pair of segments that share temporally distinct segments of the same guitar part (both from A), whereas the other instrumental parts originate from different tracks (B and C). These can be used as anchors and positive samples. We also create a pseudo musical piece that includes guitar parts from a music track different from track A for a negative sample. We showed that this method can improve the model’s performance.
Appendix 2. Consideration for the melody of drums
The number of evaluations for the melody of drums after filtering is significantly smaller than that for the others because many participants selected N/A for the melody of the drums, as illustrated in Figure 4. Among the 281 unique participants, 39 selected N/A for all evaluations to the melody of drums. In other words, 242 participants provided at least one non-N/A evaluation for drum melody across the sample sets. Furthermore, there were no sample sets for which all participants responded with N/A. However, even so, the number of sample sets where more than or equal to 80% of participants agreed on a non-NA evaluation was only four in (shown in Table 3). In other words, there were only a few sample sets in which participants exhibited higher sensitivity to differences in melody. In these sample sets, one of the two (A or B) had a drum pattern that was highly similar to X and easily distinguishable pitch contrasts, such as repeated alternations between bass and snare drums, whereas the other had a different drum pattern from X. We consider that these differences were perceived as melodic differences.

