### Further details about coupled mode theory and the scattering matrix

In the main text, we consider systems of coupled neuron modes, such as resonator modes. In general, the evolution equations for the complex field amplitudes *a*_{j}(*t*) of these modes evolving under linear physical processes are of the form^{30,31}

$${\dot{a}}_{j}(t)=-{{{\rm{i}}}}\mathop{\sum}\limits_{\ell }{H}_{j,\ell }{a}_{\ell }(t)-\sqrt{{\kappa }_{j}}{a}_{j,{{{\rm{probe}}}}}(t).$$

(9)

Here *H* is the dynamical matrix that encodes the coupling strength between the modes, decay rates and detunings. For instance, its diagonal can encode the decay due to intrinsic losses \({\kappa }_{j}^{{\prime} }\) and the coupling to probe waveguides *κ*_{j} as well as detunings \({\varDelta }_{j},{H}_{j,\;j}=-{{{\rm{i}}}}\frac{{\kappa }_{j}+{\kappa }_{j}^{{\prime} }}{2}+{\varDelta }_{j}\), whereas its off-diagonal entries *H*_{j,ℓ} correspond to couplings between the modes. Probe waveguides are coupled to all or some of the modes, which allows to inject the probe field *a*_{j,probe}(*t*) to the *j*th mode. For full generality, in the discussion around equation (1), we assumed that waveguides are attached to all the modes, although this is not a requirement for our scheme to work. In fact, we consider a scenario in which we only probe the system at a few modes in the next section. In this case, we set *κ*_{j} = 0 in equations (1) and (9) for any modes that are not coupled to waveguides.

In a typical experiment, one records the system response to the coherent probe field *a*_{j,probe}(*t*) = *a*_{j,probe}(*ω*)e^{iωt} at a certain frequency *ω*. Therefore, it is convenient to express equation (9) in the frequency domain by performing a Fourier transform as \({a}_{j}(\omega )=\frac{1}{\sqrt{2\uppi }}\int\nolimits_{-\infty }^{\infty }{{{\rm{d}}}}t\,{\rm{e}}^{{{{\rm{i}}}}\omega t}{a}_{j}(t)\), which leads to the dynamical equation in the frequency domain (equation (1)).

We obtain the expression for the scattering matrix (equation (3)) by solving equation (1) for \({{{\bf{a}}}}(\omega )=-{{{\rm{i}}}}{(\omega {\mathbb{1}}-H({{{\bf{x}}}},\theta ))}^{-1}\sqrt{\kappa }{{{{\bf{a}}}}}_{{{{\rm{probe}}}}}(\omega )\) and inserting this expression into the input–output relations^{30,31} as \({a}_{j,{{{\rm{res}}}}}(\omega )\)\(={a}_{j,{{{\rm{probe}}}}}(\omega )+\sqrt{{\kappa }_{j}}{a}_{j}(\omega )\), which establishes a relation between the mode fields *a*_{j}(*ω*), the external probe fields *a*_{j,probe}(*ω*) and the response fields *a*_{j,res}(*ω*). By convention, both *a*_{j,probe}(*ω*) and *a*_{j,res}(*ω*) are in units of Hz^{1/2}. Solving for **a**_{res}(*ω*), we obtain

$$\begin{array}{rcl}{{{{\bf{a}}}}}_{{{{\rm{res}}}}}(\omega )&=&[{\mathbb{1}}-{{{\rm{i}}}}\sqrt{\kappa }{(\omega {\mathbb{1}}-H({{{\bf{x}}}},\theta ))}^{-1}\sqrt{\kappa }]{{{{\bf{a}}}}}_{{{{\rm{probe}}}}}(\omega )\\ &\equiv &S(\omega ,{{{\bf{x}}}},\theta ){{{{\bf{a}}}}}_{{{{\rm{probe}}}}}(\omega )\end{array},$$

(10)

which yields the expression for the scattering matrix given in equation (3). Note that this matrix is still well defined in the case when not all the modes are coupled to waveguides since intrinsic losses \({\kappa }_{j}^{{\prime} }\) ensure that *H*(**x**, *θ*) is invertible.

### Layered architecture

#### Recursive solution to the scattering problem

The physical mechanism giving rise to information processing in the neuromorphic platform introduced here is, at first glance, fundamentally different from standard ANNs since waves scatter back and forth in the device rather than propagating unidirectionally. Nevertheless, with this in mind, we can choose an architecture that is inspired by the typical layer-wise structure of ANN (Extended Data Fig. 1a). It allows us to gain an analytic insight into the scattering matrix; in particular, the mathematical structure of the scattering matrix reveals a recursive structure reminiscent of the iterative application of maps in a standard ANN (Extended Data Fig. 1b). The analytic formulas also allow us to derive optimal layer sizes to make efficient use of the number of independent parameters available. This correspondence between the mathematical structure of the scattering matrix and standard neural networks only arises in a layered physical structure and is absent in a completely arbitrary scattering setup (which is, however, also covered by our general framework).

We consider a layered architecture (Fig. 2a) with *L* layers and *N*_{ℓ} neuron modes in the *ℓ*th layer. Neuron modes are only coupled to neuron modes in consecutive layers but not within a layer. Note that although we sketch fully connected layers (Fig. 2a), the network does not, in principle, have to be fully connected. However, fully connected layers have the advantage that there is a priori no ordering relation between the modes based on their proximity within a layer.

This architecture allows us to gain an analytic insight into the mathematical structure of the scattering matrix. For a compact notation, we split the vector **a** of neuron modes (equation (1)) into vectors \({{{{\bf{a}}}}}_{n}\equiv ({a}_{1}^{(n)},\ldots ,{a}_{{N}_{n}}^{(n)})\) collecting the neuron modes of the *n*th layer. Correspondingly, we define the detunings of the neuron modes in the *n*th layer as \({\varDelta }^{(n)}={{{\rm{diag}}}}\,({\varDelta }_{1}^{(n)}\ldots {\varDelta }_{{N}_{n}}^{(n)})\), the extrinsic decay rates to the waveguides \({\kappa }^{\;(n)}={{{\rm{diag}}}}\,({\kappa }_{1}^{(n)},\ldots ,{\kappa }_{{N}_{n}}^{(n)})\), the intrinsic decay rates \({\kappa }^{{\prime} (n)}={{{\rm{diag}}}}\,({\kappa }_{1}^{{\prime} (n)},\ldots ,{\kappa }_{{N}_{n}}^{{\prime} (n)})\), the total decay rate \({{\kappa }_{{{{\rm{tot}}}}}}^{(n)}={\kappa }^{{\prime} (n)}+{\kappa }^{\;(n)}\) and *J*^{(n)}, which is the coupling matrix between layer *n* and (*n* + 1) where the element \({J}_{j,\ell }^{\;(n)}\) connects neuron mode *j* in layer *n* to neuron mode *ℓ* in layer (*n* + 1). Note that this accounts for the possibility of attaching waveguides to all the modes in each layer to physically evaluate gradients according to equations (7) and (8). In the frequency domain, we obtain the equations of motion for the *n*th layer as

$$\begin{array}{rcl}-{{{\rm{i}}}}\omega {{{{\bf{a}}}}}_{n}&=&\left(-\frac{{\kappa }_{{{{\rm{tot}}}}}^{(n)}}{2}-{{{\rm{i}}}}{\varDelta }^{(n)}\right){{{{\bf{a}}}}}_{n}-{{{\rm{i}}}}{J}^{(n)}{{{{\bf{a}}}}}_{n+1}-{{{\rm{i}}}}{[\;{J}^{\;(n-1)}]}^{{\dagger} }{{{{\bf{a}}}}}_{n-1}\\ &&-\sqrt{{\kappa }^{\;(n)}}{{{{\bf{a}}}}}_{n,{{{\rm{probe}}}}},\end{array}$$

(11)

where we omitted the frequency argument *ω* for clarity. Equation (11) does not have the typical structure of a feed-forward neural network since neighbouring layers are coupled to the left and right—a consequence of wave propagation through the system.

We are interested in calculating the scattering response at the last layer *L* (the output layer), which defines the output of the neuromorphic system; therefore, we set **a**_{n,probe} = 0 for *n* ≠ *L* in equation (11) and only consider the response to the probe fields **a**_{L,probe}. The following procedure allows us to calculate the scattering matrix block *S*_{out} relating only **a**_{L,probe} to **a**_{L,res}. A suitable set of matrix elements of *S*_{out} then defines the output of the neuromorphic system via equation (4).

Solving for **a**_{1}(*ω*), then **a**_{2}(*ω*) and subsequent layers up to **a**_{L}(*ω*), we obtain a recursive formula for **a**_{n}(*ω*):

$${{{{\bf{a}}}}}_{n}={{{\rm{i}}}}{{{{\mathcal{G}}}}}_{n}\;{J}^{\;(n)}{{{{\bf{a}}}}}_{n+1}$$

(12)

with

$${{{{\mathcal{G}}}}}_{n}(\omega )\equiv {\left[\frac{{\kappa }_{{{{\rm{tot}}}}}^{(n)}}{2}+{{{\rm{i}}}}({\varDelta }^{(n)}-\omega )+{{J}^{\;(n-1)}}^{{\dagger} }{{{{\mathcal{G}}}}}_{n-1}{\;J}^{\;(n-1)}\right]}^{-1}$$

(13)

and \({{{{\mathcal{G}}}}}_{0}=0\); therefore, in the last layer, we have

$${{{{\bf{a}}}}}_{L}={{{{\mathcal{G}}}}}_{L}(\omega ){{{{\bf{a}}}}}_{L,{{{\rm{probe}}}}}.$$

(14)

At the last layer, the matrix \({{{{\mathcal{G}}}}}_{L}(\omega )\) is equal to the system’s Green function as \(G(\omega )={{{{\mathcal{G}}}}}_{L}(\omega )\) (equation (3)).

Employing input–output relations^{30,31} \({{{{\bf{a}}}}}_{L,{{{\rm{res}}}}}={{{{\bf{a}}}}}_{L,{{{\rm{probe}}}}}+\sqrt{{\kappa }^{(L)}}{{{{\bf{a}}}}}_{L}\), we obtain the scattering matrix for the response at the last layer as

$${S}_{{{{\rm{out}}}}}(\omega ,{{{\bf{x}}}},\theta )={\mathbb{1}}-\sqrt{{\kappa }^{(L)}}{\left[\frac{{\kappa }_{{{{\rm{tot}}}}}^{(L)}}{2}+{{{\rm{i}}}}({\varDelta }^{(L)}-\omega )+{{J}^{\;(L-1)}}^{{\dagger} }{{{{\mathcal{G}}}}}^{(L-1)}{J}^{\;(L-1)}\right]}^{-1}\sqrt{{\kappa }^{\;(L)}}.$$

(15)

The structure of equations (13) and (15) is reminiscent of a generalized continued fraction, with the difference that scalar coefficients are replaced by matrices. We explore this analogy further in Supplementary Information where we also show that for scalar inputs and outputs, the scattering matrix in equation (15) can approximate arbitrary analytic functions. Furthermore, the recursive structure defined by equations (13) and (14) mimics that of a standard ANN in which the weight matrix is replaced by the coupling matrix and the matrix inverse serves as the activation function. However, in contrast to the standard activation function, which is applied element-wise for each neuron, the matrix inversion acts on the entire layer. To gain intuition for the effect of taking the matrix inverse, we plot a diagonal entry of \({[{{{{\mathcal{G}}}}}_{1}(0)]}_{j,\;j}/\kappa ={\left[\frac{1}{2}+{{{\rm{i}}}}{\varDelta }_{j}^{(1)}/{\kappa }_{j}^{(1)}\right]}^{-1}\) (Fig. 2a). The real part follows a Lorentzian, whereas the imaginary part is reminiscent of a tapered sigmoid function.

Our approach bears some resemblance to the variational quantum circuits of the quantum machine learning literature^{51,52} in which the subsequent application of discrete unitary operators allows the realization of nonlinear operations. In contrast, here we consider the steady-state scattering response, which allows waves to propagate back and forth, giving rise to yet more complicated nonlinear maps (equation (3)).

#### Further details on the test case of digit recognition

### Training the neuromorphic system

The dataset consists of 3,823 training images (8 × 8 pixels) of handwritten numerals between 0 and 9 as well as 1,797 test images. We choose an architecture in which the input is encoded in the detunings \({\varDelta }_{j}^{(1)}\) of the first layer to which we add a trainable offset \({\varDelta }_{j}^{(0)}\):

$$\begin{array}{rcl}{\varDelta }_{2j}^{(1)}&=&{\varDelta }_{2j}^{(0)}+{x}_{j}\\ {\varDelta }_{2j+1}^{(1)}&=&-{\varDelta }_{2j+1}^{(0)}+{x}_{j}\end{array},$$

(16)

which we initially set to *Δ*_{2j} = *Δ*_{2j+1} = 4*κ* (for simplicity, we consider equal decay rates \({\kappa }_{j}^{(n)}+{\kappa }_{j}^{{(n)}^{{\prime} }}\equiv \kappa\) from now on). In this way, the input enters the second layer in the form of [*G*_{1}(0)]_{j,j}/*κ* once with a positive sign and once with a negative sign (the plot of [*G*_{1}(0)]_{j,j}/*κ* is shown in Fig. 2b). This doubling of the input fixes the layer size to *N*_{1} = 2 × 64. The second layer (hidden layer) can be of variable size and we train a system (1) without a hidden layer, (2) with *N*_{2} = 20, (3) with *N*_{2} = 30, (4) with *N*_{2} = 60 and (5) with *N*_{2} = 80. As we discuss in the next section and show in the Supplementary Information, *N*_{2} = 70 is the optimal number of independent parameters for this input size and given value of *R*; at *N*_{2} < 70, the neuromorphic system does not have enough independent parameters, and at *N*_{2} > 70, there are more parameters than strictly necessary that may help with the training. We use one-hot encoding of the classes, so the output layer is fixed at *N*_{3} = 10. We choose equal decay rates \({\kappa }_{j}^{(n)}+{\kappa }_{j}^{(n){\prime} }\equiv \kappa\) and express all the other parameters in terms of *κ*. For simplicity, we set the intrinsic decay to zero (*κ*′ = 0) at the last layer where we probe the system.

We initialize the system by making the neuron modes in the second and third layers resonant and add a small amount of disorder, that is, \({\varDelta }_{j}^{(n)}/\kappa ={w}_{\varDelta }({\xi }_{j}^{\;(n)}-1/2)\) with w_{Δ} = 0.002 and *ξ* ∈ [0, 1). Similarly, we set the coupling rates to 2*κ* and add a small amount of disorder, that is, \({J}_{j,\ell }^{\;(n)}/\kappa =2+{w}_{J}{\xi }_{j,\ell }^{\;(n)}\) with *w*_{J} = 0.2. We empirically found that this initialization leads to the fastest convergence of the training. Furthermore, we scale our input images such that the background is off-resonant at *Δ*/*κ* = 5 and the numerals are resonant at *Δ*/*κ* = 0 (Fig. 2a) since, otherwise, the initial gradients are very small.

As the output of the system, we consider the imaginary part of the diagonal entries of the scattering matrix (equation (15)) at the last layer at *ω* = 0, that is, *y*_{ℓ} ≡ Im*S*_{ℓ,ℓ}(*ω* = 0, **x**, *θ*) (Fig. 2a). The goal is to minimize the categorical cross-entropy cost function \({{{\mathcal{C}}}}=-{\sum }_{j}{\;y}_{j}^{{{{\rm{tar}}}}}\log \left(\frac{{e}^{\;\beta {y}_{j}}}{{\sum }_{\ell }{e}^{\beta {y}_{\ell }}}\right)\) with *β* = 8 in which \({y}_{j}^{{{{\rm{tar}}}}}\) is the *j*th component of the target output of the system, which is 1 at the index of the correct class and 0 elsewhere. We train the system by performing stochastic gradient descent, that is, for one mini-batch, we select 200 random images, and compute the gradients of the cost function according to which we adjust the system parameters: the detunings \({\varDelta }_{j}^{(n)}\) and the coupling rates \({J}_{j,\ell }^{\;(n)}\).

We show the accuracy evaluated on the test set at different stages during the training for five different architectures (Fig. 2b). Both convergence speed and maximally attainable test accuracy depend on the size of the hidden layer, with the system without the hidden layer performing the worst. This should not surprise us, since an increase in *N*_{2} also increases the number of trainable parameters. Interestingly, we observed that systems with smaller *N*_{2} have a higher tendency to get stuck. Training these systems for a very long time may, in some cases, however, still lead to convergence to a higher accuracy. The confusion matrix of the trained system with *N*_{2} = 80 after 2,922 epochs is shown in Fig. 2c.

We show the evolution of the output scattering matrix elements evaluated for a few specific images (Fig. 2d). For this training run, we used Adam optimization. During training, the scattering matrix quickly converges to the correct classification result. Only for images that look very similar (for example, the image of the numerals 4 and 9 (Fig. 2d)), the training oscillates between the two classes and takes a longer time to converge.

### Comparison to conventional classifiers

Comparing neuromorphic architectures against standard ANNs in terms of achievable test accuracy and other measures can be useful, provided that the implications of such a comparison are properly understood. In performing this comparison, we have to keep in mind the main motivations for switching to neuromorphic architectures: exploiting the energy savings afforded by analogue processing based on the basic physical dynamics of the device, getting rid of the inefficiencies stemming from the von Neumann bottleneck (separation of memory and processor) and potential speedup by a high degree of parallelism. Therefore, the use of neuromorphic platforms can be justified even when, for example, a conventional ANN with the same parameter count is able to show higher accuracy. The usefulness of such a comparison rather lies in providing a sense of scale: we want to make sure that the neuromorphic platform can reach good training results in typical tasks, not too far off those of ANNs, with a reasonable amount of physical resources. Furthermore, the classification results attained by either ANNs or neuromorphic systems depend on hyperparameters and initialization. Although ANNs have been optimized over the years, it stands to reason that strategies for achieving higher classification accuracies can also be found for neuromorphic systems (such as our approach) in the future.

We compared our results against a linear classifier (ANN with linear activations) and an ANN with an input layer size of 64 neurons, a hidden layer of size 150 (to obtain the same number of parameters as in the neuromorphic system) and an output layer of size 10. For the ANN, we used sigmoid activation functions and a softmax function at the last layer. In the first case, we used a mean-squared-error cost function (to benchmark the performance of a purely linear classifier); in the latter case, we used a categorical cross-entropy cost function. The linear classifier attained a test accuracy of 92.30%, which is clearly surpassed by our neuromorphic system, whereas the ANN achieved a test accuracy of 97.05%, which is on par with the accuracy attained by our neuromorphic system (97.10%). To allow for a fair comparison of the convergence speed (Fig. 2b), we trained all the networks with the stochastic gradient descent.

We also compare these results with a CNN. The low image resolution of the dataset prevents the use of standard CNNs (such as LeNet-5). Since the small image size only allows for one convolution step, we trained the following CNN: the input layer is followed by one convolutional layer with six channels and a pooling layer, followed by densely connected layers of sizes 120, 84 and 10. We used two different kernels for the convolution: (1) 3 × 3 with stride 2 and (2) 5 × 5 with stride 2; in the former, we obtained a test accuracy of 96.4%, whereas in the latter, the test accuracy amounted to 96.7%. Figure 2b shows the accuracy during training. Although the maximal accuracy of the CNN lies below that of the fully connected neural network, it converges faster. We attribute the slightly worse performance of the CNN to the low resolution of the dataset.

The accuracies obtained by ANNs and CNNs lie below the record accuracy achieved by a CNN on MNIST. However, this should not surprise us since (1) the dataset we study^{33} is entirely different from the MNIST dataset, (2) the resolution of the dataset is lower than of the MNIST dataset (8 × 8 instead of 28 × 28).

### Case study: Fashion-MNIST

To study the performance of our approach on a more complex dataset, we performed image classification on the Fashion-MNIST dataset^{53}. This dataset consists of 60,000 images of 28 × 28 pixels of ten different types of garment in the training set and 10,000 images in the test set. Fashion-MNIST is considered to be more challenging than MNIST; in fact, most typical classifiers perform considerably worse on Fashion-MNIST than on MNIST.

We consider a transmission setup in which the probe light is injected at the first layer, illuminating all the modes equally, and the response is measured at the other end of the system, Extended Data Fig. 2a (in contrast to the study in the main text for which probe and response were chosen at the last layer) since we (2) wanted to test an architecture even closer to standard layered neural networks and (2) found this to somewhat increase the performance for this dataset. Following an analogous derivation as that for equation (15), we obtain the following expression for the scattering matrix:

$${S}_{{{{\rm{out}}}}}(\omega ,{{{\bf{x}}}},\theta )=-\mathrm{i}\sqrt{{\kappa }^{\;(L)}}\left\{\displaystyle\prod_{n = 2}^{L}{\tilde{\chi }}_{n}{[\;{J}^{\;(n-1)}]}^{{\dagger} }\right\}{\tilde{\chi }}_{1}\sqrt{{\kappa }^{(1)}}{\xi }_{{{{\rm{probe}}}}},$$

(17)

in which, as above, the coupling matrix acting on layer *n* − 1 is [*J*^{(n–1)}]^{†} and \({\xi }_{j,{{{\rm{probe}}}}}=1/\sqrt{{N}_{1}}\). The recursively defined effective susceptibilities are given by (*n* = *L* − 1…1)

$${\tilde{\chi }}_{n}={\left({\chi }_{n}^{\;-1}-{J}^{\;(n)}{\tilde{\chi }}_{n+1}{[\;{J}^{\;(n)}]}^{{\dagger} }\right)}^{-1},$$

(18)

in which *χ*_{n} = (*ω* – *Δ*^{(n)} + i*κ*^{(n)}/2)^{–1} and \({\tilde{\chi }}_{L}={\chi }_{L}\). We note that *κ*^{(1)} is the diagonal matrix containing the (uniform) decay rates that describe the coupling of an external waveguide to all the neuron modes of layer 1, which, thus, experience a homogeneous probe drive. Furthermore, inputs *x* are encoded into layer 1 by adding them to the frequencies *Δ*^{(1)}. In the experiments studied below, we did not need any input replication to obtain good results.

Specifically, we study two different architectures of our neuromorphic system: (1) a layer architecture of three fully connected layers; (2) an architecture inspired by a CNN with the aim of reducing the parameter count and directly comparing against convolutional ANNs (Supplementary Section VI). In both cases, we trained the networks on both full resolution images and downsized versions of 14 × 14 pixels obtained by averaging over 2 × 2 patches.

We show the test accuracy during training (Extended Data Fig. 2b) for two different learning rates (10^{−3} and 10^{−4}) using AMSGrad as the optimizer^{54} (performing slightly better than Adam) and compare them with ANNs and CNNs and a linear neural network with (approximately) the same number of parameters and the same training procedure. A comparison of the best attained accuracies as well as a list of the number of parameters and neuron modes or ANN neurons is provided in Supplementary Table I. We obtained the best test accuracy of 89.8% with the fully connected layer network on both small and large images with a learning rate of 10^{−3} as well as on small images with a learning rate of 10^{−4}. This clearly exceeds the performance of the linear neural network, which only achieves a peak test accuracy of 86.5%. The comparable digital ANN achieved 89.7% on the small images at a learning rate of 10^{−3}, 90.5% on the large images at a learning rate of 10^{−3} and 90.6% on the small images at a learning rate of 10^{−4}. For the sake of a more systematic comparison, we used a constant learning rate for the results reported here. However, we found that introducing features like learning-rate scheduling can lead to slightly better results with our setup (for example, we reached up to 90.1% accuracy for the fully connected layer neuromorphic setup trained on small images in only 100 epochs using a linearly cycled learning rate peaking at 0.1).

Our best convolutional network achieved 88.2% on the large image data with learning rates of 10^{−3} and 10^{−4}, whereas the comparable CNNs attained 90.5% and 91.2%, respectively. These architectures contain only a fraction of the number of parameters of the fully connected architecture; therefore, they may be preferable for implementations. Increasing the number of parameters of CNNs close to that of the fully connected ANNs (by increasing the number of channels) can increase the accuracy to 93% (ref. ^{55}), but comes at the cost of a large number of parameters. Furthermore, more advanced CNNs with features such as dropout as well as the use of data augmentation or other preprocessing techniques can achieve a test accuracy above 93% (ref. ^{55}), but for the sake of comparability, we chose to train ANNs and CNNs with a similar architecture as our neuromorphic system. For comparison, a detailed list of benchmarks with various digital classifiers without preprocessing is provided in another work^{53}, with logistic regression achieving 84.2% and support vector classifiers attaining 89.7%.

### Remark on Adam optimization

Using adaptive gradient descent techniques, such as Adam optimization, can help increase the convergence speed. We successfully used adaptive optimizers (such as Adam and AMSGrad) for training optimization on the Fashion-MNIST dataset. However, during the training on the digit recognition dataset (with small images and thus small neuron counts), we observed that Adam optimization causes the cost function and accuracy to strongly fluctuate and the training can, in some cases, get stuck. We attribute this to the fact that in the training on the digit dataset, the gradients get small very quickly (on the order of 10^{−6}), which is known to make Adam optimization unstable. Stochastic gradient descent does not suffer from the same problem, which was, therefore, more suited for training on the digit dataset. The best accuracy we achieved when using Adam optimization on the digit dataset was 96.0%, which is below the 97.1% value that we achieved with stochastic gradient descent.

### Hyperparameters

The architecture we propose has the following hyperparameters, which influence the accuracy and convergence speed: (1) the number of neuron modes per layer *N*_{ℓ}; (2) the number of layers *L*; (3) the number of input replications *R*; (4) the intrinsic decay rate of the system.

(1) The number of neuron modes in the first layer should match the product of the input dimension *D* and the number of replications of the input *R*. The number of neurons in the final layer is given by the output dimension, for example, for classification tasks, the dimension corresponds to the number of classes. The sizes of intermediate layers should be chosen to provide enough training parameters; as shown in Fig. 2b, both convergence speed and best accuracy rely on having a sufficient number of training parameters available, as, for instance, a system with a second layer of only 20 neuron modes performs worse than a system with 60 neuron modes in the second layer. In particular, as we show in the Supplementary Information, independent of the architecture, the maximal number of independent couplings for an input of dimension *D*, input replication *R* and output dimension *N*_{out} is given by

$${N}_{{{{\rm{opt}}}}}=\frac{(RD+{N}_{{{{\rm{out}}}}})(RD+{N}_{{{{\rm{out}}}}}+1)}{2}.$$

(19)

In addition, there are *R**D* + *N*_{out} local detunings that can be independently varied. For the layered structure (Fig. 2a), this translates to an optimal size of the second layer as *N*_{2,opt} = ⌈(*N*_{1} + *N*_{3} + 1)/2⌉. For a system with *N*_{1} = 128 and *N*_{3} = 10, the optimal value *N*_{2,opt} is given by *N*_{2,opt} = 70. Smaller *N*_{2} leads to fewer independent parameters, whereas larger *N*_{2} introduces redundant parameters. In our simulations, the system with *N*_{2} = 60 already achieved a high classification accuracy; therefore, it may not always be necessary to increase the layer size to *N*_{2,opt}. Further increasing the number of parameters and introducing redundant parameters can help avoid getting stuck during training. Indeed, we obtained the best classification accuracy for a system with *N*_{2} = 80. Similarly, even though a neuromophic system with all-to-all couplings provides a sufficient number of independent parameters, considering a system with multiple hidden layers can be advantageous for training.

(2) Similar considerations apply to choosing the number of layers of the network. *L* should be large enough to provide a sufficient number of training parameters. At the same time, *L* should not be too large to minimize attenuation losses. For deep networks with *L* ≿ 3, localization effects may become important^{56,57}, which may hinder the training; therefore, it is advisable to choose an architecture with sufficiently small *L*.

(3) The choice of *R* depends on the complexity of the training set. For the digit recognition task, *R* = 2 was sufficient, but more complicated datasets can require larger *R*. We explore this question in the Supplementary Information, where we show for scalar functions that *R* determines the approximation order and utilize our system to fit scalar functions. To fit quickly oscillating functions or functions with other sharp features, we require larger *R*.

Here we only considered encoding the input in the first layer. However, it could be interesting to explore, in the future, whether spreading the (replicated) input over different layers holds an advantage, since this would allow to make subsequent layers more ‘nonlinear’.

In general, the input replication *R* can be thought of as the approximation order of the nonlinear output function. Specifically, an input replication *R* implies that the scattering matrix is not only a function of *x*_{j} but also \({x}_{j}^{2}\)…\({x}_{j}^{R}\). This statement also holds in other possible neuromorphic systems based on linear physics, for example, in a sequential scattering setup.

(4) The intrinsic decay rate determines the sharpest feature that can be resolved, or, equivalently, a larger rate *κ* smoothens the output function. It is straightforward to see from equation (7) that the derivative of the output with respect to a component of the input scales with *κ*. In the context of one-dimensional function fitting, this is straightforward to picture, and we provide some examples in the Supplementary Information. However, a larger decay rate can be compensated for by rescaling the range of input according to *κ*. For instance, for the digit recognition training, we set the image background to *Δ*/*κ* = 5 and made the pixels storing the number resonant *Δ*/*κ* = 0, which is still possible in lossy systems.