Aiming at the problems of significant noise interference in sonar images and insufficient representation capability under small-sample conditions in fish feeding behavior recognition, this paper proposes a dual-stream spatio-temporal attention network that fuses domain knowledge and deep features. First, an improved wavelet filtering algorithm is proposed to effectively remove bubble noise in sonar images. Then, a dual-stream feature fusion architecture is designed, where the statistical feature stream includes 6-dimensional features such as target quantity and spacing standard deviation, and the deep feature stream extracts high-order semantic features of sonar images through the Residual Network (ResNet18). Meanwhile, a Long Short-Term Memory network (LSTM) is introduced to capture the temporal dependency of behavior sequences, and a spatio-temporal cross-attention mechanism is combined to adaptively focus on key frames and target areas. Experiments on the self-built dataset show that the classification accuracy of this network reaches 77.0%, among which wavelet denoising, dual-stream fusion, and spatio-temporal attention mechanism contribute precision improvements of 1.8%, 5.9%, and 2.8% respectively, verifying the effectiveness of each component. This study provides a new method for underwater target behavior recognition.