Perception of multimodal objects in NLP through computer vision


  • Sakib Hosen Himel Department of Computer Science and Engineering, Daffodil International University, Dhaka-1207, Bangladesh
  • Mahidul Islam Rana Department of Computer Science and Engineering, Daffodil International University, Dhaka-1207, Bangladesh



MobileNet, SSD-V3, Object detection, NLP, Computer vision, COCO dataset


This project is based on voice interaction and object detecting properties. It will allow the users to do voice interaction with the artificial intelligence and it will reply with the system voice. That is how users will use their voice to command as a trigger to find out the category of any object by showing it using the camera module. At first, the user will show an object with the help of a camera and ask for identifying it in the system. The object detection system then captures a frame from the camera and predicts through the structure to identify which class the object belongs to by extracting the feature from there. The process of this application is to search the database to match the structural data to find out the exact category of the object. When this system approximately matches with the information of a category then the application will suggest the category for the object by mentioning the category name through voice. This application can also give some basic information by asking for it. Our general-purpose approach can be effective in interpreting the structure and properties of objects in different networks through natural language processing.


Download data is not yet available.


Azevedo, F. A. C., Carvalho, L. R. B., Grinberg, L. T., Farfel, J. M., Ferretti, R. E. L., Leite R. E. P., Filho, W. J., Lent, R., & Herculano-Houzel, S. (2009). Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain. The Journal of Comparative Neurology, 513(5), 532-541.

Budiharto, W. (2014). Robust vision-based detection and grasping object for manipulator using SIFT keypoint detector. International Conference on Advanced Mechatronic Systems (pp. 448-452). IEEE.

Budiharto, W., Gunawan, A. A. S., Suroso, J. S., Chowanda, A., Patrik, A., & Utama, G. (2018). Fast object detection for quadcopter drone using deep learning. International Conference on Computer and Communication Systems (pp. 192-195). IEEE.

COCO. (2021). Common Objects in Context. Retrieved from https://

Graetz, F. M. (2018). RetinaNet: how Focal Loss fixes Single-Shot Detection. Retrieved from

Hui, J. (2018). SSD object detection: Single Shot MultiBox Detector for real-time processing. Retrieved from

Li, J., Chen, X., Hovy, E., & Jurafsky, D. (2015). Visualizing and understanding neural models in NLP. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 681-691). Association for Computational Linguistics.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Lecture Notes in Computer Science: European conference on computer vision (vol. 8693, pp. 740-755). Cham: Springer.

Medium. (2021). Object Detection with SSD and MobileNet. Retrieved from

Wang, S. C. (2003). Artificial neural network. In Interdisciplinary computing in java programming (pp. 81-100). Boston, US: Springer.

Yeremia, H., Yuwono, N. A., Raymond, P., & Budiharto, W. (2013). Genetic algorithm and neural network for optical character recognition. Journal of Computer Science, 9(11), 1435-1442.



How to Cite

Himel, S. H., & Rana, M. I. . (2023). Perception of multimodal objects in NLP through computer vision. Recent Research in Science and Technology, 15, 1–7.