Fishery Modernization ›› 2025, Vol. 52 ›› Issue (6): 115-122. doi: 10.26958/j.cnki.1007-9580.2025.06.014

Previous Articles     Next Articles

Fish feeding behavior recognition based on sonar images and dual-stream spatio-temporal attention

WANG Zhijun1,2, ZHAO Xia1(1 School of Electronic and Information Engineering, Tongji University, Shanghai 201804,China;#br# 2Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200092,China)   

  1. (1 School of Electronic and Information Engineering, Tongji University, Shanghai 201804,China;
    2Fishery Machinery and Instrument Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200092,China)
  • Online:2025-12-20 Published:2025-12-26

基于声呐图像与双流时空注意力的鱼类摄食行为识别

王志俊1,2,赵霞1(1 同济大学电子与信息工程学院,上海 201804;
2 中国水产科学研究院渔业机械仪器研究所,上海 200092)   

  1. (1 同济大学电子与信息工程学院,上海 201804;
    2 中国水产科学研究院渔业机械仪器研究所,上海 200092)
  • 通讯作者: 赵霞(1974—),女, 博士,副教授,研究方向:控制算法、深度学习。E-mail: zhaoxia@tongji.edu.cn
  • 作者简介:王志俊 (1990—),男,硕士研究生,研究方向:深度学习、信号处理。E-mail: wang_zhijun@tongji.edu.cn

  • 基金资助:
    国家重点研发计划(2023YFD2401304)

Abstract: Aiming at the problems of significant noise interference in sonar images and insufficient representation capability under small-sample conditions in fish feeding behavior recognition, this paper proposes a dual-stream spatio-temporal attention network that fuses domain knowledge and deep features. First, an improved wavelet filtering algorithm is proposed to effectively remove bubble noise in sonar images. Then, a dual-stream feature fusion architecture is designed, where the statistical feature stream includes 6-dimensional features such as target quantity and spacing standard deviation, and the deep feature stream extracts high-order semantic features of sonar images through the Residual Network (ResNet18). Meanwhile, a Long Short-Term Memory network (LSTM) is introduced to capture the temporal dependency of behavior sequences, and a spatio-temporal cross-attention mechanism is combined to adaptively focus on key frames and target areas. Experiments on the self-built dataset show that the classification accuracy of this network reaches 77.0%, among which wavelet denoising, dual-stream fusion, and spatio-temporal attention mechanism contribute precision improvements of 1.8%, 5.9%, and 2.8% respectively, verifying the effectiveness of each component. This study provides a new method for underwater target behavior recognition.


Key words: sonar image, wavelet denoising, feature fusion, LSTM, spatio-temporal cross-attention

摘要: 针对鱼类摄食行为识别中存在的声呐图像噪声干扰显著、小样本条件下表征能力不足等问题,本研究提出一种融合领域知识与深度特征的双流时空注意力网络。首先提出改进的小波滤波算法,有效去除声呐图像中的气泡噪声。接着设计了双流特征融合架构,其中,统计特征流包含目标数量、间距标准差等6维特征,深度特征流通过残差网络(ResNet18)提取声呐图像的高阶语义特征。同时引入长短期记忆网络(LSTM)捕获行为序列的时序依赖性,并结合时空交叉注意力机制自适应聚焦关键帧与目标区。在自建数据集上试验结果显示,本网络的分类准确率达77.0%,其中小波去噪、双流融合和时空注意力机制分别贡献了1.8%、5.9%和2.8% 的精度提升,验证了各组件的有效性。该研究为基于图像声呐的水下目标行为识别提供了新方法。


关键词: 声呐图像, 小波去噪, 特征融合, LSTM, 时空交叉注意力