Publications
2025
- C&GSHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)Trong-Thuan Nguyen, Viet-Tham Huynh, Quang-Thuc Nguyen, Hoang-Phuc Nguyen, Long Le Bao, and 28 more authorsComputers & Graphics (Special Section on 3DOR 2025), 2025(Q2, IF = 2.8 in 2024)
Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system’s ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.
@article{Le2025ROOMELSA, title = {SHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)}, author = {Nguyen, Trong-Thuan and Huynh, Viet-Tham and Nguyen, Quang-Thuc and Nguyen, Hoang-Phuc and Bao, Long Le and Minh, Thai Hoang and Anh, Minh Nguyen and Tien, Thang Nguyen and Thuan, Phat Nguyen and Phong, Huy Nguyen and Thai, Bao Huynh and Nguyen, Vinh-Tiep and Nguyen, Duc-Vu and Pham, Phu-Hoa and Le-Hoang, Minh-Huy and Le, Nguyen-Khang and Nguyen, Minh-Chinh and Ho, Minh-Quan and Tran, Ngoc-Long and Le-Hoang, Hien-Long and Tran, Man-Khoi and Tran, Anh-Duong and Nguyen, Kim and Hung, Quan Nguyen and Thanh, Dat Phan and Van, Hoang Tran and Viet, Tien Huynh and Thien, Nhan Nguyen Viet and Vo, Dinh-Khoi and Nguyen, Van-Loc and Le, Trung-Nghia and Nguyen, Tam V. and Tran, Minh-Triet}, journal = {Computers & Graphics (Special Section on 3DOR 2025)}, year = {2025}, note = {(Q2, IF = 2.8 in 2024)}, } - CBMDYNAFormer: Enhancing transformer segmentation with dynamic anchor mask for medical imagingTan-Cong Nguyen, Kim Anh Phung, Thao Thi Phuong Dao, Trong-Hieu Nguyen-Mau, Thuc Nguyen-Quang, and 5 more authorsComputers in Biology and Medicine, 2025(Q1, IF = 6.3 in 2024)
Polyp shape is critical for diagnosing colorectal polyps and assessing cancer risk, yet there is limited data on segmenting pedunculated and sessile polyps. This paper introduces PolypDB_INS, a dataset of 4403 images containing 4918 annotated polyps, specifically for sessile and pedunculated polyps. In addition, we propose DYNAFormer, a novel transformer-based model utilizing an anchor mask-guided mechanism that incorporates cross-attention, dynamic query updates, and query denoising for improved object segmentation. Treating each positional query as an anchor mask dynamically updated through decoder layers enhances perceptual information regarding the object’s position, allowing for more precise segmentation of complex structures like polyps. Extensive experiments on the PolypDB_INS dataset using standard evaluation metrics for both instance and semantic segmentation show that DYNAFormer significantly outperforms state-of-the-art methods. Ablation studies confirm the effectiveness of the proposed techniques, highlighting the model’s robustness for diagnosing colorectal cancer.
@article{Le2025DYNAFormer, title = {DYNAFormer: Enhancing transformer segmentation with dynamic anchor mask for medical imaging}, author = {Nguyen, Tan-Cong and Phung, Kim Anh and Dao, Thao Thi Phuong and Nguyen-Mau, Trong-Hieu and Nguyen-Quang, Thuc and Pham, Cong Nhan and Le, Trung-Nghia and Shen, Ju and Nguyen, Tam V. and Tran, Minh-Triet}, journal = {Computers in Biology and Medicine}, year = {2025}, note = {(Q1, IF = 6.3 in 2024)}, } - IEEE AccessLookupForensics: A Large-Scale Multi-Task Dataset for Multi-Phase Image-Based Fact VerificationShuhan Cui, Huy H. Nguyen, Trung-Nghia Le, Chun-Shien Lu, and Isao EchizenIEEE Access, 2025(Q1, IF = 3.9 in 2022)
Amid the proliferation of forged images, notably the tsunami of deepfake content, extensive research has been conducted on using artificial intelligence (AI) to identify forged content in the face of continuing advancements in counterfeiting technologies. We have investigated the use of AI to provide the original authentic image after deepfake detection, which we believe is a reliable and persuasive solution. We call this "image-based automated fact verification," a name that originated from a text-based fact-checking system used by journalists. We have developed a two-phase open framework that integrates detection and retrieval components. Additionally, inspired by a dataset proposed by Meta Fundamental AI Research, we further constructed a large-scale dataset that is specifically designed for this task. This dataset simulates real-world conditions and includes both content-preserving and content-aware manipulations that present a range of difficulty levels and have potential for ongoing research. This multi-task dataset is fully annotated, enabling it to be utilized for sub-tasks within the forgery identification and fact retrieval domains. This paper makes two main contributions: (1) We introduce a new task, "image-based automated fact verification," and present a novel two-phase open framework combining "forgery identification" and "fact retrieval." (2) We present a large-scale dataset tailored for this new task that features various hand-crafted image edits and machine learning-driven manipulations, with extensive annotations suitable for various sub-tasks. Extensive experimental results validate its practicality for fact verification research and clarify its difficulty levels for various sub-tasks.
@article{Le2025LookupForensics, title = {LookupForensics: A Large-Scale Multi-Task Dataset for Multi-Phase Image-Based Fact Verification}, author = {Cui, Shuhan and Nguyen, Huy H. and Le, Trung-Nghia and Lu, Chun-Shien and Echizen, Isao}, journal = {IEEE Access}, year = {2025}, note = {(Q1, IF = 3.9 in 2022)}, } - NCAGUNNEL: Guided Mixup Augmentation and Multi-Model Fusion for Aquatic Animal SegmentationMinh-Quan Le*, Trung-Nghia Le*, Tam V. Nguyen, Isao Echizen, and Minh-Triet TranNeural Computing & Applications, 2025(Q1, IF = 4.5 in 2023)
Recent years have witnessed great advances in object segmentation research. In addition to generic objects, aquatic animals have attracted research attention. Deep learning-based methods are widely used for aquatic animal segmentation and have achieved promising performance. However, there is a lack of challenging datasets for benchmarking. In this work, we build a new dataset dubbed "Aquatic Animal Species." We also devise a novel GUided mixup augmeNtatioN and multi-modEl fusion for aquatic animaL segmentation (GUNNEL) that leverages the advantages of multiple segmentation models to segment aquatic animals effectively and improves the training performance by synthesizing hard samples. Extensive experiments demonstrated the superiority of our proposed framework over existing state-of-the-art instance segmentation methods.
@article{Le2025GUNNEL, title = {GUNNEL: Guided Mixup Augmentation and Multi-Model Fusion for Aquatic Animal Segmentation}, author = {Le, Minh-Quan and Le, Trung-Nghia and Nguyen, Tam V. and Echizen, Isao and Tran, Minh-Triet}, journal = {Neural Computing & Applications}, year = {2025}, note = {(Q1, IF = 4.5 in 2023)}, project_page = {https://github.com/lmquan2000/mask-mixup}, dataset = {https://doi.org/10.5281/zenodo.8208877} } - SoICTHierarchical Multi-Modal Retrieval for News Image CaptioningMinh-Loi Nguyen*, Xuan-Vu Le*, Long-Bao Nguyen, Hoang-Bach Ngo, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2025(Oral)
@inproceedings{Loi2025HierarchicalMRN, title = {Hierarchical Multi-Modal Retrieval for News Image Captioning}, author = {Nguyen, Minh-Loi and Le, Xuan-Vu and Nguyen, Long-Bao and Ngo, Hoang-Bach and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025}, note = {(Oral)} } - SoICTVortex: Multi-Modal Fusion System for Intelligent Video RetrievalDuc-Tho Nguyen, Hieu-Hoc Tran-Minh, Khanh-Hoa Lam, Hoang-Nhut Ly, Huu-Phuc Huynh, and 2 more authorsIn International Symposium on Information and Communication Technology (SoICT), 2025(Oral)
@inproceedings{Nguyen2025Vortex, title = {Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval}, author = {Nguyen, Duc-Tho and Tran-Minh, Hieu-Hoc and Lam, Khanh-Hoa and Ly, Hoang-Nhut and Huynh, Huu-Phuc and Tran, Thanh-Tien and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025}, note = {(Oral)} } - SoICTForged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of DiffusionDuc-Manh Phan*, Quoc-Duy Tran*, Duy-Khang Do*, Anh-Tuan Vo, Hai-Dang Nguyen, and 7 more authorsIn International Symposium on Information and Communication Technology (SoICT), 2025(Oral)
@inproceedings{Phan2025ForgedCalamity, title = {Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion}, author = {Phan, Duc-Manh and Tran, Quoc-Duy and Do, Duy-Khang and Vo, Anh-Tuan and Nguyen, Hai-Dang and Do, Trong Le and Tran, Mai-Khiem and Nguyen, Vinh-Tiep and Nguyen, Tam V. and Echizen, Isao and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025}, note = {(Oral)} } - SoICTCIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented GenerationThi Thu Hien Trinh, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2025
@inproceedings{Trinh2025CIAN, title = {CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation}, author = {Trinh, Thi Thu Hien and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025} } - SoICTVisionGuard: Synergistic Framework for Helmet Violation DetectionThanh-Hai Nguyen*, Thinh-Phuc Nguyen*, Gia-Huy Dinh*, Lam-Huy Nguyen*, Minh-Triet Tran, and 1 more authorIn International Symposium on Information and Communication Technology (SoICT), 2025
@inproceedings{Nguyen2025VisionGuard, title = {VisionGuard: Synergistic Framework for Helmet Violation Detection}, author = {Nguyen, Thanh-Hai and Nguyen, Thinh-Phuc and Dinh, Gia-Huy and Nguyen, Lam-Huy and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025} } - SoICTEdit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian SplattingDuy-Dat Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2025
@inproceedings{Tran2025Edit3DGS, title = {Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting}, author = {Tran, Duy-Dat and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025} } - SoICTVisual Retrieval-Augmented Generation for Silhouette-Guided Animal ArtQuoc-Duy Tran, Anh-Tuan Vo, Minh-Triet Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2025
@inproceedings{Tran2025VisualRAG, title = {Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art}, author = {Tran, Quoc-Duy and Vo, Anh-Tuan and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025} } - SoICTExploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image RetrievalNguyen Hoang Cao*, Hoang Bui Le*, Nam Vo Hoang*, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2025
@inproceedings{Cao2025FashionRetrieval, title = {Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval}, author = {Cao, Nguyen Hoang and Le, Hoang Bui and Hoang, Nam Vo and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025} } - SoICTDTD-Mamba: Dual Teacher Distillation for Mamba in Head and Neck Abscess SegmentationThao Thi Phuong Dao, Tan-Cong Nguyen, Trong-Le Do, Mai-Khiem Tran, Minh-Khoi Pham, and 3 more authorsIn International Symposium on Information and Communication Technology (SoICT), 2025(Oral)
@inproceedings{Dao2025DTDmamba, title = {DTD-Mamba: Dual Teacher Distillation for Mamba in Head and Neck Abscess Segmentation}, author = {Dao, Thao Thi Phuong and Nguyen, Tan-Cong and Do, Trong-Le and Tran, Mai-Khiem and Pham, Minh-Khoi and Le, Trung-Nghia and Tran, Minh-Triet and Le, Thanh Dinh}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025}, note = {(Oral)} } - SoICTMasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain IntegrationThao Thi Phuong Dao, Tan-Cong Nguyen, Nguyen Chi Thanh, Truong Hoang Viet, Trong-Le Do, and 5 more authorsIn International Symposium on Information and Communication Technology (SoICT), 2025(Oral)
@inproceedings{Dao2025MasHeNe, title = {MasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain Integration}, author = {Dao, Thao Thi Phuong and Nguyen, Tan-Cong and Thanh, Nguyen Chi and Viet, Truong Hoang and Do, Trong-Le and Tran, Mai-Khiem and Pham, Minh-Khoi and Le, Trung-Nghia and Tran, Minh-Triet and Le, Thanh Dinh}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025}, note = {(Oral)} } - SoICTAEye: Avian Monitoring from Streaming VideosKasturi Jamale*, Kunal Agrawal*, Ba-Thinh Tran-Le, Jayanth Merakanapalli, Soham Chousalkar, and 3 more authorsIn International Symposium on Information and Communication Technology (SoICT), 2025(Oral)
@inproceedings{Jamale2025AEye, title = {AEye: Avian Monitoring from Streaming Videos}, author = {Jamale, Kasturi and Agrawal, Kunal and Tran-Le, Ba-Thinh and Merakanapalli, Jayanth and Chousalkar, Soham and Patel, Vatsa and Le, Trung-Nghia and Nguyen, Tam V.}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025}, note = {(Oral)} } - SoICTResearch Paper Quality Recognition Through Textual Feature AnalysisSaikiran Korla*, Sadwik Gummadavelli*, Trung-Nghia Le, Minh-Triet Tran, and Tam V. NguyenIn International Symposium on Information and Communication Technology (SoICT), 2025
@inproceedings{Korla2025PaperQuality, title = {Research Paper Quality Recognition Through Textual Feature Analysis}, author = {Korla, Saikiran and Gummadavelli, Sadwik and Le, Trung-Nghia and Tran, Minh-Triet and Nguyen, Tam V.}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2025} } - OzCHIMultiPointing: Supporting Multiple Users’ Pointing in Hybrid MeetingsDinh-Thuan Duong-Le, Duy-Nam Ly, Trung-Nghia Le, Vinh-Tiep Nguyen, and Khanh-Duy LeIn Australian Conference on Human-Computer Interaction (OzCHI), 2025(B Rank) (Late Breaking Work)
@inproceedings{DuongLe2025MultiPointing, title = {MultiPointing: Supporting Multiple Users' Pointing in Hybrid Meetings}, author = {Duong-Le, Dinh-Thuan and Ly, Duy-Nam and Le, Trung-Nghia and Nguyen, Vinh-Tiep and Le, Khanh-Duy}, booktitle = {Australian Conference on Human-Computer Interaction (OzCHI)}, year = {2025}, note = {(B Rank) (Late Breaking Work)} } - ACM MMOpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event GroundingHieu Nguyen, Phuc-Tan Nguyen, Thien-Phuc Tran, Minh-Quang Nguyen, Tam V. Nguyen, and 2 more authorsIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Dataset)
We introduce OpenEvents V1a large-scale benchmark dataset designed to advance event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that focus on surface-level descriptions, OpenEvents V1 dataset emphasizes contextual and temporal grounding through three primary tasks: (1) generating rich, event-aware image captions, (2) retrieving event-relevant news articles from image queries, and (3) retrieving event-relevant images from narrative-style textual queries. The dataset comprises over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for all tasks. OpenEvents V1 establishes a robust foundation for developing multimodal AI systems capable of deep reasoning over complex real-world events.
@inproceedings{Nguyen2025OpenEvents, title = {OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding}, author = {Nguyen, Hieu and Nguyen, Phuc-Tan and Tran, Thien-Phuc and Nguyen, Minh-Quang and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Dataset)}, } - ACM MMEvent-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025Thien-Phuc Tran*, Minh-Quang Nguyen*, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do, and 5 more authorsIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Challenge)
The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility.
@inproceedings{Tran2025EventEnrichedChallenge, title = {Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025}, author = {Tran, Thien-Phuc and Nguyen, Minh-Quang and Tran, Minh-Triet and Nguyen, Tam V. and Do, Trong-Le and Ly, Duy-Nam and Huynh, Viet-Tham and Le, Khanh-Duy and Tran, Mai-Khiem and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Challenge)}, } - ACM MMMulti-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image ClassificationY Hop Nguyen, Doan Anh Phan Huu, Trung Thai Tran, Nhat Nam Mai, Van Toi Giap, and 2 more authorsIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Challenge)
We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM’25 ENTRep Grand Challenge, achieving 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96. Ablation studies demonstrated the incremental benefits of each architectural component, validating the effectiveness of our design for robust multimodal medical understanding in low-resource clinical settings.
@inproceedings{Nguyen2025MultiLevelCLS, title = {Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification}, author = {Nguyen, Y Hop and Huu, Doan Anh Phan and Tran, Trung Thai and Mai, Nhat Nam and Giap, Van Toi and Dao, Thao Thi Phuong and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Challenge)}, } - ACM MMReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian NormalizationThinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Lam-Huy Nguyen, Minh-Triet Tran, and 1 more authorIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Challenge)
Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap’s effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains.
@inproceedings{Nguyen2025ReCap, title = {ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization}, author = {Nguyen, Thinh-Phuc and Nguyen, Thanh-Hai and Dinh, Gia-Huy and Nguyen, Lam-Huy and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Challenge)}, } - ACM MMEVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic CaptionsDinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, and Trung-Nghia LeIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Challenge)
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding.
@inproceedings{Vo2025EventRetriever, title = {EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions}, author = {Vo, Dinh-Khoi and Nguyen, Van-Loc and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Challenge)}, } - ACM MMStreamlining Virtual KOL Generation Through Modular Generative AI ArchitectureTan-Hiep To, Duy-Khang Nguyen, Minh-Triet Tran, and Trung-Nghia LeIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Demo)
@inproceedings{To2025StreamliningKOL, title = {Streamlining Virtual KOL Generation Through Modular Generative AI Architecture}, author = {To, Tan-Hiep and Nguyen, Duy-Khang and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Demo)} } - ACM MMAdvancing Fashion Design Through Intelligent Sketchpad StudioNhu-Binh Nguyen-Truc*, Nhu-Vinh Hoang*, Tam V. Nguyen, Minh-Triet Tran, and Trung-Nghia LeIn ACM International Conference on Multimedia (ACM MM), 2025(A* Rank) (Demo)
@inproceedings{Nguyen2025Sketchpad, title = {Advancing Fashion Design Through Intelligent Sketchpad Studio}, author = {Nguyen-Truc, Nhu-Binh and Hoang, Nhu-Vinh and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM International Conference on Multimedia (ACM MM)}, year = {2025}, note = {(A* Rank) (Demo)} } - MICCAILearning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology SegmentationHa-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Ulas Bagci, Min Xu, and 2 more authorsIn MICCAI Workshop on Computational Pathology with Multimodal Data (COMPAYL), 2025
The global fashion e-commerce industry has become integral to people’s daily lives, leveraging technological advancements to offer personalized shopping experiences, primarily through recommendation systems that enhance customer engagement through personalized suggestions. To improve customers’ experience in online shopping, we propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on. We explore two approaches for outfit retrieval: similar item retrieval and text feedback-guided item retrieval. Notably, we introduce a novel transformer architecture designed to recommend complementary items from diverse categories. Furthermore, we enhance the overall performance of the search pipeline by integrating approximate algorithms to optimize the search process. Additionally, addressing the crucial needs of online shoppers, we employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs compared to its predecessors. This virtual try-on module empowers users to visualize specific garments on themselves, enhancing the customers’ experience and reducing costs associated with damaged items for retailers. We deployed our end-to-end system for online users to test and provide feedback, enabling us to measure their satisfaction levels. The results of our user study revealed that 84% of participants found our comprehensive system highly useful, significantly improving their online shopping experience.
@inproceedings{Pham2025Histopathology, title = {Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation}, author = {Pham, Ha-Hieu and Vu, Nguyen Lan Vi and Nguyen, Thanh-Huy and Bagci, Ulas and Xu, Min and Le, Trung-Nghia and Pham, Huy-Hieu}, booktitle = {MICCAI Workshop on Computational Pathology with Multimodal Data (COMPAYL)}, year = {2025}, project_page = {https://github.com/hieuphamha19/CSDS} } - MAPRSAMURAI: Shape-Aware Multimodal Retrieval for 3D Object IdentificationDinh-Khoi Vo*, Van-Loc Nguyen*, Minh-Triet Tran, and Trung-Nghia LeIn International Conference on Multimedia Analysis and Pattern Recognition (MAPR), 2025
Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.
@inproceedings{Vo2025SAMURAI, title = {SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification}, author = {Vo, Dinh-Khoi and Nguyen, Van-Loc and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Multimedia Analysis and Pattern Recognition (MAPR)}, year = {2025}, } - CBMIGenFlow: Interactive Modular System for Image GenerationDuc-Hung Nguyen*, Huu-Phuc Huynh*, Minh-Triet Tran, and Trung-Nghia LeIn International Conference on Content-Based Multimedia Indexing (CBMI), 2025
Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow’s ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.
@inproceedings{Nguyen2025GenFlow, title = {GenFlow: Interactive Modular System for Image Generation}, author = {Nguyen, Duc-Hung and Huynh, Huu-Phuc and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Content-Based Multimedia Indexing (CBMI)}, year = {2025}, } - ICCCIAutomated Image Recognition FrameworkQuang-Binh Nguyen*, Trong-Vu Hoang*, Do Tran Ngoc, Tam V. Nguyen, Minh-Triet Tran, and 1 more authorIn International Conference on Computational Collective Intelligence (ICCCI), 2025(B Rank)
While the efficacy of deep learning models heavily relies on data, gathering and annotating data for specific tasks, particularly when addressing novel or sensitive subjects lacking relevant datasets, poses significant time and resource challenges. In response to this, we propose a novel Automated Image Recognition (AIR) framework that harnesses the power of generative AI. AIR empowers end-users to synthesize high-quality, pre-annotated datasets, eliminating the necessity for manual labeling. It also automatically trains deep learning models on the generated datasets with robust image recognition performance. Our framework includes two main data synthesis processes, AIR-Gen and AIR-Aug. The AIR-Gen enables end-users to seamlessly generate datasets tailored to their specifications. To improve image quality, we introduce a novel automated prompt engineering module that leverages the capabilities of large language models. We also introduce a distribution adjustment algorithm to eliminate duplicates and outliers, enhancing the robustness and reliability of generated datasets. On the other hand, the AIR-Aug enhances a given dataset, thereby improving the performance of deep classifier models. AIR-Aug is particularly beneficial when users have limited data for specific tasks. Through comprehensive experiments, we demonstrated the efficacy of our generated data in training deep learning models and showcased the system’s potential to provide image recognition models for a wide range of objects. We also conducted a user study that achieved an impressive score of 4.4 out of 5.0, underscoring the AI community’s positive perception of AIR.
@inproceedings{Nguyen2025AIR, title = {Automated Image Recognition Framework}, author = {Nguyen, Quang-Binh and Hoang, Trong-Vu and Ngoc, Do Tran and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Computational Collective Intelligence (ICCCI)}, year = {2025}, note = {(B Rank)}, } - ICCCIChat2Edit: A Prompt-based Image Editor with Live Feedback and Parameter RecommendationTin-Nghia Le, Phuong-Dao Duong Dinh, Quang Huy Che, Duc-Vu Nguyen, Vinh-Tiep Nguyen, and 3 more authorsIn International Conference on Computational Collective Intelligence (ICCCI), 2025(B Rank)
@inproceedings{Le2025Chat2Edit, title = {Chat2Edit: A Prompt-based Image Editor with Live Feedback and Parameter Recommendation}, author = {Le, Tin-Nghia and Dinh, Phuong-Dao Duong and Che, Quang Huy and Nguyen, Duc-Vu and Nguyen, Vinh-Tiep and Nguyen, Tam V. and Le, Trung-Nghia and Tran, Minh-Triet}, booktitle = {International Conference on Computational Collective Intelligence (ICCCI)}, year = {2025}, note = {(B Rank)} } - ICCCIFaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized RefinementGia-Nghia Tran, Quang-Huy Che, Trong-Tai Dam Vu, Bich-Nga Pham, Vinh-Tiep Nguyen, and 2 more authorsIn International Conference on Computational Collective Intelligence (ICCCI), 2025(B Rank)
Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept’s attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
@inproceedings{Tran2025FaR, title = {FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement}, author = {Tran, Gia-Nghia and Che, Quang-Huy and Vu, Trong-Tai Dam and Pham, Bich-Nga and Nguyen, Vinh-Tiep and Le, Trung-Nghia and Tran, Minh-Triet}, booktitle = {International Conference on Computational Collective Intelligence (ICCCI)}, year = {2025}, note = {(B Rank)}, } - WACVCamoFA: A Learnable Fourier-based Augmentation for Camouflage SegmentationMinh-Quan Le, Minh-Triet Tran, Trung-Nghia Le, Tam V. Nguyen, and Thanh-Toan DoIn Winter Conference on Applications of Computer Vision (WACV), 2025(A Rank)
Camouflaged object detection (COD) and camouflaged instance segmentation (CIS) aim to recognize and segment objects that are blended into their surroundings, respectively. While several deep neural network models have been proposed to tackle those tasks, augmentation methods for COD and CIS have not been thoroughly explored. Augmentation strategies can help improve models’ performance by increasing the size and diversity of the training data and exposing the model to a wider range of variations in the data. Besides, we aim to automatically learn transformations that help to reveal the underlying structure of camouflaged objects and allow the model to learn to better identify and segment camouflaged objects. To achieve this, we propose a learnable augmentation method in the frequency domain for COD and CIS via the Fourier transform approach, dubbed CamoFA. Our method leverages a conditional generative adversarial network and cross-attention mechanism to generate a reference image and an adaptive hybrid swapping with parameters to mix the low-frequency component of the reference image and the high-frequency component of the input image. This approach aims to make camouflaged objects more visible for detection and segmentation models. Without bells and whistles, our proposed augmentation method boosts the performance of camouflaged object detectors and instance segmenters by large margins.
@inproceedings{Le2025CamoFA, title = {CamoFA: A Learnable Fourier-based Augmentation for Camouflage Segmentation}, author = {Le, Minh-Quan and Tran, Minh-Triet and Le, Trung-Nghia and Nguyen, Tam V. and Do, Thanh-Toan}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2025}, note = {(A Rank)}, } - FAIRComprehensive Analysis of AI-Synthetic Image Detection ArchitecturesThien-Hoa Hoang-Don, Tien-Dat Nguyen, Nam-Anh Nguyen, and Trung-Nghia LeIn National Conference on Fundamental and Applied IT Research (FAIR), 2025
@inproceedings{HoangDon2025SyntheticImageDetection, title = {Comprehensive Analysis of AI-Synthetic Image Detection Architectures}, author = {Hoang-Don, Thien-Hoa and Nguyen, Tien-Dat and Nguyen, Nam-Anh and Le, Trung-Nghia}, booktitle = {National Conference on Fundamental and Applied IT Research (FAIR)}, year = {2025} } - GenKOL: Modular Generative AI Framework For Scalable Virtual KOL GenerationTan-Hiep To, Duy-Khang Nguyen, Tam V. Nguyen, Minh-Triet Tran, and Trung-Nghia LearXiv preprint arXiv:2509.14927, 2025
Key Opinion Leader (KOL) play a crucial role in modern marketing by shaping consumer perceptions and enhancing brand credibility. However, collaborating with human KOLs often involves high costs and logistical challenges. To address this, we present GenKOL, an interactive system that empowers marketing professionals to efficiently generate high-quality virtual KOL images using generative AI. GenKOL enables users to dynamically compose promotional visuals through an intuitive interface that integrates multiple AI capabilities, including garment generation, makeup transfer, background synthesis, and hair editing. These capabilities are implemented as modular, interchangeable services that can be deployed flexibly on local machines or in the cloud. This modular architecture ensures adaptability across diverse use cases and computational environments. Our system can significantly streamline the production of branded content, lowering costs and accelerating marketing workflows through scalable virtual KOL creation.
@article{To2025GenKOL, title = {GenKOL: Modular Generative AI Framework For Scalable Virtual KOL Generation}, author = {To, Tan-Hiep and Nguyen, Duy-Khang and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, journal = {arXiv preprint arXiv:2509.14927}, year = {2025}, } - KiseKloset: Comprehensive System For Outfit Retrieval, Recommendation, And Try-OnThanh-Tung Phan-Nguyen, Khoi-Nguyen Nguyen-Ngoc, Tam V. Nguyen, Minh-Triet Tran, and Trung-Nghia LearXiv preprint arXiv:2506.23471, 2025
The global fashion e-commerce industry has become integral to people’s daily lives, leveraging technological advancements to offer personalized shopping experiences, primarily through recommendation systems that enhance customer engagement through personalized suggestions. To improve customers’ experience in online shopping, we propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on. We explore two approaches for outfit retrieval: similar item retrieval and text feedback-guided item retrieval. Notably, we introduce a novel transformer architecture designed to recommend complementary items from diverse categories. Furthermore, we enhance the overall performance of the search pipeline by integrating approximate algorithms to optimize the search process. Additionally, addressing the crucial needs of online shoppers, we employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs compared to its predecessors. This virtual try-on module empowers users to visualize specific garments on themselves, enhancing the customers’ experience and reducing costs associated with damaged items for retailers. We deployed our end-to-end system for online users to test and provide feedback, enabling us to measure their satisfaction levels. The results of our user study revealed that 84% of participants found our comprehensive system highly useful, significantly improving their online shopping experience.
- Interactive Interface For Semantic Segmentation Dataset SynthesisNgoc-Do Tran, Minh-Tuan Huynh, Tam V. Nguyen, Minh-Triet Tran, and Trung-Nghia LearXiv preprint arXiv:2506.23470, 2025
The rapid advancement of AI and computer vision has significantly increased the demand for high-quality annotated datasets, particularly for semantic segmentation. However, creating such datasets is resource-intensive, requiring substantial time, labor, and financial investment, and often raises privacy concerns due to the use of real-world data. To mitigate these challenges, we present SynthLab, consisting of a modular platform for visual data synthesis and a user-friendly interface. The modular architecture of SynthLab enables easy maintenance, scalability with centralized updates, and seamless integration of new features. Each module handles distinct aspects of computer vision tasks, enhancing flexibility and adaptability. Meanwhile, its interactive, user-friendly interface allows users to quickly customize their data pipelines through drag-and-drop actions. Extensive user studies involving a diverse range of users across different ages, professions, and expertise levels, have demonstrated flexible usage, and high accessibility of SynthLab, enabling users without deep technical expertise to harness AI for real-world applications.
- PrefPaint: Enhancing Image Inpainting through Expert Human FeedbackDuy-Bao Bui, Hoang-Khang Nguyen, and Trung-Nghia LearXiv preprint arXiv:2506.21834, 2025
Inpainting, the process of filling missing or corrupted image parts, has broad applications, including medical imaging. However, in specialized fields like medical polyps imaging, where accuracy and reliability are critical, inpainting models can generate inaccurate images, leading to significant errors in medical diagnosis and treatment. To ensure reliability, medical images should be annotated by experts like oncologists for effective model training. We propose PrefPaint, an approach that incorporates human feedback into the training process of Stable Diffusion Inpainting, bypassing the need for computationally expensive reward models. In addition, we develop a web-based interface streamlines training, fine-tuning, and inference. This interactive interface provides a smooth and intuitive user experience, making it easier to offer feedback and manage the fine-tuning process. User study on various domains shows that PrefPaint outperforms existing methods, reducing visual inconsistencies and improving image rendering, particularly in medical contexts, where our model generates more realistic polyps images.
- TaleForge: Interactive Multimodal System for Personalized Story CreationMinh-Loi Nguyen, Quang-Khai Le, Tam V. Nguyen, Minh-Triet Tran, and Trung-Nghia LearXiv preprint arXiv:2506.21832, 2025
Storytelling is a deeply personal and creative process, yet existing methods often treat users as passive consumers, offering generic plots with limited personalization. This undermines engagement and immersion, especially where individual style or appearance is crucial. We introduce TaleForge, a personalized story-generation system that integrates large language models (LLMs) and text-to-image diffusion to embed users’ facial images within both narratives and illustrations. TaleForge features three interconnected modules: Story Generation, where LLMs create narratives and character descriptions from user prompts; Personalized Image Generation, merging users’ faces and outfit choices into character illustrations; and Background Generation, creating scene backdrops that incorporate personalized characters. A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists. Participants praised the system’s real-time previews and intuitive controls, though they requested finer narrative editing tools. TaleForge advances multimodal storytelling by aligning personalized text and imagery to create immersive, user-centric experiences.
- VisionGuard: Synergistic Framework for Helmet Violation DetectionThinh-Phuc Nguyen*, Thanh-Hai Nguyen*, Gia-Huy Dinh*, Lam-Huy Nguyen*, Minh-Triet Tran, and 1 more authorarXiv preprint arXiv:2506.21005, 2025
Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.
- Shape2Animal: Creative Animal Generation from Natural SilhouettesQuoc-Duy Tran, Anh-Tuan Vo, Dinh-Khoi Vo, Tam V. Nguyen, Minh-Triet Tran, and 1 more authorarXiv preprint arXiv:2506.20616, 2025
Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design.
- ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept GenerationTrong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, and 1 more authorarXiv preprint arXiv:2506.18493, 2025
Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow’s effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.
- CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image EditingDinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, and Trung-Nghia LearXiv preprint arXiv:2506.18438, 2025
Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects’ shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.
2024
- MLArtificial Intelligence for Laryngoscopy in Vocal Fold Diseases: A Review of Dataset, Technology, and EthicsThao Thi Phuong Dao, Tan-Cong Nguyen, Viet-Tham Huynh, Xuan-Hai Bui, Trung-Nghia Le, and 1 more authorMachine Learning, 2024(Q1, IF = 4.3 in 2023) (ACML 2024, Journal track)
Laryngoscopy plays a crucial role in providing essential visual access to the larynx, especially vocal folds, for diagnosis and treatment interventions. The field of laryngoscopy is witnessing remarkable advancements driven by artificial intelligence (AI) and deep learning, particularly in diagnosing vocal fold disorders. This paper delves into a comprehensive analysis of diverse publicly available laryngoscopy image datasets and cutting-edge deep learning techniques, demonstrating their immense potential to revolutionize diagnostic accuracy and efficiency. However, the ethical and legal challenges surrounding AI in healthcare cannot be overlooked. We meticulously examine critical considerations such as dataset collection, algorithm bias, and responsible clinical application. By addressing these concerns, we emphasize the pivotal role AI can play while ensuring fairness, trust, and adherence to medical ethics. Our aim is to foster a comprehensive understanding of both the potential and the ethical considerations for implementing AI in laryngoscopy. This responsible approach will ultimately lead to improved patient outcomes and a stronger foundation for medical ethics in the age of AI.
@article{Le2024AILaryngoscopy, title = {Artificial Intelligence for Laryngoscopy in Vocal Fold Diseases: A Review of Dataset, Technology, and Ethics}, author = {Dao, Thao Thi Phuong and Nguyen, Tan-Cong and Huynh, Viet-Tham and Bui, Xuan-Hai and Le, Trung-Nghia and Tran, Minh-Triet}, journal = {Machine Learning}, year = {2024}, note = {(Q1, IF = 4.3 in 2023) (ACML 2024, Journal track)}, } - IIMImproving Laryngoscopy Image Analysis through Integration of Global Information and Local Features in VoFoCD DatasetThao Thi Phuong Dao, Tuan-Luc Huynh, Minh-Khoi Pham, Trung-Nghia Le, Tan-Cong Nguyen, and 5 more authorsImaging Informatics in Medicine, 2024(Q1, IF = 4.4 in 2022)
The diagnosis and treatment of vocal fold disorders heavily rely on the use of laryngoscopy. A comprehensive vocal fold diagnosis requires accurate identification of crucial anatomical structures and potential lesions during laryngoscopy observation. However, existing approaches have yet to explore the joint optimization of the decision-making process, including object detection and image classification tasks simultaneously. In this study, we provide a new dataset, VoFoCD, with 1724 laryngology images designed explicitly for object detection and image classification in laryngoscopy images. Images in the VoFoCD dataset are categorized into four classes and comprise six glottic object types. Moreover, we propose a novel Multitask Efficient trAnsformer network for Laryngoscopy (MEAL) to classify vocal fold images and detect glottic landmarks and lesions. To further facilitate interpretability for clinicians, MEAL provides attention maps to visualize important learned regions for explainable artificial intelligence results toward supporting clinical decision-making. We also analyze our model’s effectiveness in simulated clinical scenarios where shaking of the laryngoscopy process occurs. The proposed model demonstrates outstanding performance on our VoFoCD dataset. The accuracy for image classification and mean average precision at an intersection over a union threshold of 0.5 (mAP50) for object detection are 0.951 and 0.874, respectively. Our MEAL method integrates global knowledge, encompassing general laryngoscopy image classification, into local features, which refer to distinct anatomical regions of the vocal fold, particularly abnormal regions, including benign and malignant lesions. Our contribution can effectively aid laryngologists in identifying benign or malignant lesions of vocal folds and classifying images in the laryngeal endoscopy process visually.
@article{Le2024VoFoCD, title = {Improving Laryngoscopy Image Analysis through Integration of Global Information and Local Features in VoFoCD Dataset}, author = {Dao, Thao Thi Phuong and Huynh, Tuan-Luc and Pham, Minh-Khoi and Le, Trung-Nghia and Nguyen, Tan-Cong and Nguyen, Quang-Thuc and Tran, Bich Anh and Van, Boi Ngoc and Ha, Chanh Cong and Tran, Minh-Triet}, journal = {Imaging Informatics in Medicine}, year = {2024}, note = {(Q1, IF = 4.4 in 2022)}, } - IEEE AccesseKYC-DF: A Large-Scale Deepfake Dataset for Developing and Evaluating eKYC SystemsHichem Felouat, Huy H. Nguyen, Trung-Nghia Le, Junichi Yamagishi, and Isao EchizenIEEE Access, 2024(Q1, IF = 3.9 in 2022)
The reliability of remote identity-proofing systems (i.e., electronic Know Your Customer, or eKYC, systems) is challenged by the development of deepfake generation tools, which can be used to create fake videos that are difficult to detect using existing deepfake detection models and are indistinguishable by facial recognition systems. This poses a serious threat to eKYC systems and a danger to individuals’ personal information and property. Existing deepfake datasets are not particularly appropriate for developing and evaluating eKYC systems, which require specific motions, such as head movement, for liveness detection. Furthermore, they do not contain ID information or protocols for facial verification evaluation, which is vital for eKYC. We found that eKYC systems without the ability to detect deepfakes can be easily compromised. We have thus created a large-scale collection of high-quality fake videos (more than 228,000 videos) that are diverse in terms of age, gender, and ethnicity, plus a corresponding facial image subset. The videos include a variety of head movements and facial expressions. This large collection of high-quality diverse videos is well-suited for developing and evaluating various tasks related to eKYC systems. Furthermore, we provide protocols for traditional deepfake detection and facial verification, which are widely used in eKYC systems. It is worth mentioning that systematic evaluation of facial recognition systems on deepfake detection has not been reported.
@article{Le2024eKYCDF, title = {eKYC-DF: A Large-Scale Deepfake Dataset for Developing and Evaluating eKYC Systems}, author = {Felouat, Hichem and Nguyen, Huy H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, journal = {IEEE Access}, year = {2024}, note = {(Q1, IF = 3.9 in 2022)}, } - IEEE AccessAnalysis of Fine-grained Counting Methods for Masked Face Counting: A Comparative StudyKhanh-Duy Nguyen, Huy H. Nguyen, Trung-Nghia Le, Junichi Yamagishi, and Isao EchizenIEEE Access, 2024(Q1, IF = 3.9 in 2022)
Masked face counting is the counting of faces at various crowd densities and discriminating between masked and unmasked faces, which is generally considered to be an object (i.e., face) detection task. Counting accuracy is limited, especially at higher densities, when the faces are relatively small, unclear, and viewed at various angles. Furthermore, it is costly to create the ground-truth bounding boxes needed to train object detection methods. We formulate masked face detection as a fine-grained crowd-counting task, which is appropriate for tackling this challenging task when used with density map regression. However, adopting fine-grained crowd-counting methods for masked face counting is not trivial. It is necessary to identify strategies appropriate for both counting and multi-class classification. We contrasted the strategies of various approaches and examined their benefits and drawbacks. These strategies include (1) simple regression with mixed regression and detection for counting, (2) using class-aware density maps with semantic segmentation maps and class probabilities for classification, and (3) counting with or without depth information enhancement. Analysis of seven crowd-counting methods on three datasets with a total of about 900k annotations demonstrated that the level of congestion affects how well simple regression and mixed regression and detection work for counting. Meanwhile, the most effective approach for classification is using semantic segmentation maps. Evaluation of the usefulness of using depth data demonstrated the need for a depth map to achieve accurate counting. These findings should be useful for future studies.
@article{Le2024MaskedFaceCounting, title = {Analysis of Fine-grained Counting Methods for Masked Face Counting: A Comparative Study}, author = {Nguyen, Khanh-Duy and Nguyen, Huy H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, journal = {IEEE Access}, year = {2024}, note = {(Q1, IF = 3.9 in 2022)}, } - SoICTLanguage-Guided Video Object SegmentationMinh Duy Phan, Minh Huan Le, Minh-Triet Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{Phan2024LanguageVOS, title = {Language-Guided Video Object Segmentation}, author = {Phan, Minh Duy and Le, Minh Huan and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)} } - SoICTVisChronos: Revolutionizing Image Captioning Through Real-Life EventsPhuc-Tan Nguyen*, Hieu Nguyen*, Minh-Triet Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{Nguyen2024VisChronos, title = {VisChronos: Revolutionizing Image Captioning Through Real-Life Events}, author = {Nguyen, Phuc-Tan and Nguyen, Hieu and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)}, dataset = {https://zenodo.org/records/14004909} } - SoICTEPEdit: Redefining Image Editing with Generative AI and User-Centric DesignHoang-Phuc Nguyen*, Dinh-Khoi Vo*, Trong-Le Do, Hai-Dang Nguyen, Tan-Cong Nguyen, and 5 more authorsIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{Nguyen2024EPEdit, title = {EPEdit: Redefining Image Editing with Generative AI and User-Centric Design}, author = {Nguyen, Hoang-Phuc and Vo, Dinh-Khoi and Do, Trong-Le and Nguyen, Hai-Dang and Nguyen, Tan-Cong and Nguyen, Vinh-Tiep and Nguyen, Tam V. and Le, Khanh-Duy and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)} } - SoICTMythraGen: Two-Stage Retrieval Augmented Art Generation FrameworkQuang-Khai Le*, Cong-Long Nguyen*, Minh-Triet Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{Le2024MythraGen, title = {MythraGen: Two-Stage Retrieval Augmented Art Generation Framework}, author = {Le, Quang-Khai and Nguyen, Cong-Long and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)} } - SoICTKidRisk: Benchmark Dataset for Children Dangerous Action RecognitionMinh-Kha Nguyen*, Trung-Hieu Do*, Kim Anh Phung, Thao Thi Phuong Dao, Minh-Triet Tran, and 1 more authorIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{Nguyen2024KidRisk, title = {KidRisk: Benchmark Dataset for Children Dangerous Action Recognition}, author = {Nguyen, Minh-Kha and Do, Trung-Hieu and Phung, Kim Anh and Dao, Thao Thi Phuong and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)} } - SoICTDanceDuo: Bridging Human Movement and AI ChoreographyGia-Cat Bui-Le, Tuong-Vy Truong-Thuy, Hai-Dang Nguyen, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{BuiLe2024DanceDuo, title = {DanceDuo: Bridging Human Movement and AI Choreography}, author = {Bui-Le, Gia-Cat and Truong-Thuy, Tuong-Vy and Nguyen, Hai-Dang and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)} } - SoICTBudget-Aware Keyboardless InteractionQuang-Thang Nguyen*, Gia-Phuc Song-Dong*, Minh-Triet Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024(Oral)
@inproceedings{Nguyen2024Keyboardless, title = {Budget-Aware Keyboardless Interaction}, author = {Nguyen, Quang-Thang and Song-Dong, Gia-Phuc and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024}, note = {(Oral)} } - SoICTDecoding Deepfakes: Caption Guided Learning for Robust Deepfake DetectionY-Hop Nguyen, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024
@inproceedings{Nguyen2024DeepfakeCaption, title = {Decoding Deepfakes: Caption Guided Learning for Robust Deepfake Detection}, author = {Nguyen, Y-Hop and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024} } - SoICTMinimalist Preprocessing Approach for Image Synthesis DetectionHoai-Danh Vo, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024
@inproceedings{Vo2024SynthesisDetection, title = {Minimalist Preprocessing Approach for Image Synthesis Detection}, author = {Vo, Hoai-Danh and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024} } - SoICTHybrid Compression: Integrating Pruning and Quantization for Optimized Neural NetworksMinh-Loi Nguyen*, Long-Bao Nguyen*, Van-Hieu Huynh*, Minh-Triet Tran, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2024
@inproceedings{Nguyen2024HybridCompression, title = {Hybrid Compression: Integrating Pruning and Quantization for Optimized Neural Networks}, author = {Nguyen, Minh-Loi and Nguyen, Long-Bao and Huynh, Van-Hieu and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024} } - SoICTMotion Analysis in Static ImagesKunal Agrawal, Vatsa Patel, Reema Tharra, Trung-Nghia Le, Minh-Triet Tran, and 1 more authorIn International Symposium on Information and Communication Technology (SoICT), 2024
@inproceedings{Agrawal2024MotionAnalysis, title = {Motion Analysis in Static Images}, author = {Agrawal, Kunal and Patel, Vatsa and Tharra, Reema and Le, Trung-Nghia and Tran, Minh-Triet and Nguyen, Tam V.}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024} } - SoICTAI-Generated Image Recognition via Fusion of CNNs and Vision TransformersXuan-Bach Mai, Hoang-Tung Vu, Hoang-Minh Nguyen-Huu, Quoc-Nghia Nguyen, Minh-Triet Tran, and 1 more authorIn International Symposium on Information and Communication Technology (SoICT), 2024
@inproceedings{Mai2024FusionDetection, title = {AI-Generated Image Recognition via Fusion of CNNs and Vision Transformers}, author = {Mai, Xuan-Bach and Vu, Hoang-Tung and Nguyen-Huu, Hoang-Minh and Nguyen, Quoc-Nghia and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2024} } - ACCVRethinking Sampling for Music-Driven Long-Term Dance GenerationTuong-Vy Truong-Thuy, Gia-Cat Bui-Le, Hai-Dang Nguyen, and Trung-Nghia LeIn Asian Conference on Computer Vision (ACCV), 2024(B Rank)
Generating dance sequences that synchronize with music while maintaining naturalness and realism is a challenging task. Existing methods often suffer from freezing phenomena or abrupt transitions. In this work, we introduce DanceFusion, a conditional diffusion model designed to address the complexities of creating long-term dance sequences. Our method employs a past and future-conditioned diffusion model, leveraging the attention mechanism to learn the dependencies among music, past, and future motions. We also propose a novel sampling method that completes the transitional motions between two dance segments by treating previous and upcoming motions as conditions. Additionally, we address abruptness in dance sequences by incorporating inpainting strategies into a part of the sampling process, thereby improving the smoothness and naturalness of motion generation. Experimental results demonstrate that DanceFusion outperforms state-of-the-art methods in generating high-quality and diverse dance motions. User study results further validate the effectiveness of our approach in generating long dance sequences, with participants consistently rating DanceFusion higher across all key metrics. Code and model are available at https://github.com/trgvy23/DanceFusion.
@inproceedings{TruongThuy2024DanceGeneration, title = {Rethinking Sampling for Music-Driven Long-Term Dance Generation}, author = {Truong-Thuy, Tuong-Vy and Bui-Le, Gia-Cat and Nguyen, Hai-Dang and Le, Trung-Nghia}, booktitle = {Asian Conference on Computer Vision (ACCV)}, year = {2024}, note = {(B Rank)}, project_page = {https://github.com/trgvy23/DanceFusion} } - ACCVCrossPAR: Enhancing Pedestrian Attribute Recognition with Vision-Language Fusion and Human-Centric Pre-trainingBach-Hoang Ngo, Si-Tri Ngo, Phu-Duc Le, Quang-Minh Phan, Minh-Triet Tran, and 1 more authorIn Asian Conference on Computer Vision (ACCV), 2024(B Rank)
Pedestrian attribute recognition (PAR) is crucial in various applications like surveillance and urban planning. Accurately identifying attributes in diverse and intricate urban environments is challenging despite its significance. This paper introduces a novel network for PAR that integrates a human-centric encoder, trained on extensive human datasets, with a vision-language encoder, trained on substantial text-image pair datasets. We also develop a cross-attention mechanism utilizing a Mixture-of-Experts approach that combines the human-centric encoder’s proficiency in local attribute detection with the vision-language encoder’s ability to comprehend global content. CrossPAR showcases a comparable accuracy result compared to existing techniques across multiple benchmarks, using less training data. These results confirm our approach’s effectiveness and suggest promising avenues for further research and practical applications in the domain of PAR and related fields.
@inproceedings{Ngo2024CrossPAR, title = {CrossPAR: Enhancing Pedestrian Attribute Recognition with Vision-Language Fusion and Human-Centric Pre-training}, author = {Ngo, Bach-Hoang and Ngo, Si-Tri and Le, Phu-Duc and Phan, Quang-Minh and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {Asian Conference on Computer Vision (ACCV)}, year = {2024}, note = {(B Rank)}, } - ISMARImmersive Spatiotemporal Travel in Virtual RealityThanh Ngoc-Dat Tran, Viet-Tham Huynh, Poojitha Moganti, Trung-Nghia Le, Minh-Triet Tran, and 1 more authorIn International Symposium on Mixed and Augmented Reality (ISMAR), 2024(A* Rank, Poster)
@inproceedings{Tran2024ImmersiveVR, title = {Immersive Spatiotemporal Travel in Virtual Reality}, author = {Tran, Thanh Ngoc-Dat and Huynh, Viet-Tham and Moganti, Poojitha and Le, Trung-Nghia and Tran, Minh-Triet and Nguyen, Tam V.}, booktitle = {International Symposium on Mixed and Augmented Reality (ISMAR)}, year = {2024}, note = {(A* Rank, Poster)} } - ISMARUrban Traffic Planning Simulation with Time and Weather DynamicsTam V. Nguyen, Thanh Ngoc-Dat Tran, Viet-Tham Huynh, Vatsa S Patel, Umang Jai, and 3 more authorsIn International Symposium on Mixed and Augmented Reality (ISMAR), 2024(A* Rank, Poster)
@inproceedings{Nguyen2024UrbanSim, title = {Urban Traffic Planning Simulation with Time and Weather Dynamics}, author = {Nguyen, Tam V. and Tran, Thanh Ngoc-Dat and Huynh, Viet-Tham and Patel, Vatsa S and Jai, Umang and Tran, Mai-Khiem and Le, Trung-Nghia and Tran, Minh-Triet}, booktitle = {International Symposium on Mixed and Augmented Reality (ISMAR)}, year = {2024}, note = {(A* Rank, Poster)} } - CVPRWSynthetic Is All You Need For Semantic SegmentationMinh-Tuan Huynh*, Ngoc-Do Tran*, Minh-Triet Tran, and Trung-Nghia LeIn SyntaGen Workshop, CVPR, 2024(First Prize)
@inproceedings{Huynh2024SyntheticSeg, title = {Synthetic Is All You Need For Semantic Segmentation}, author = {Huynh, Minh-Tuan and Tran, Ngoc-Do and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {SyntaGen Workshop, CVPR}, year = {2024}, note = {(First Prize)}, invited_paper = {https://syntagen.github.io/#syntagen-competition}, project_page = {https://github.com/synth-e/Syntagen-Solution} } - MAPRRethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene RecognitionTrong-Vu Hoang, Quang-Binh Nguyen, Dinh-Khoi Vo, Hoai-Danh Vo, Minh-Triet Tran, and 1 more authorIn International Conference on Multimedia Analysis and Pattern Recognition (MAPR), 2024
@inproceedings{Hoang2024SemanticAugmentation, title = {Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition}, author = {Hoang, Trong-Vu and Nguyen, Quang-Binh and Vo, Dinh-Khoi and Vo, Hoai-Danh and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Multimedia Analysis and Pattern Recognition (MAPR)}, year = {2024} } - MAPREvaluation of Image Matching for Art Skills AssessmentAsaad Alghamdi, Michael Poor, Trung-Nghia Le, and Tam V. NguyenIn International Conference on Multimedia Analysis and Pattern Recognition (MAPR), 2024
@inproceedings{Alghamdi2024ArtAssessment, title = {Evaluation of Image Matching for Art Skills Assessment}, author = {Alghamdi, Asaad and Poor, Michael and Le, Trung-Nghia and Nguyen, Tam V.}, booktitle = {International Conference on Multimedia Analysis and Pattern Recognition (MAPR)}, year = {2024} } - MAPRMasked Face Recognition on Limited Training DataPhuoc-Sang Pham, Minh-Kha Nguyen, Minh-Hien Le, Minh-Triet Tran, and Trung-Nghia LeIn International Conference on Multimedia Analysis and Pattern Recognition (MAPR), 2024
@inproceedings{Pham2024MaskedFace, title = {Masked Face Recognition on Limited Training Data}, author = {Pham, Phuoc-Sang and Nguyen, Minh-Kha and Le, Minh-Hien and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Multimedia Analysis and Pattern Recognition (MAPR)}, year = {2024} } - CHIiCONTRA: Toward Thematic Collection Design Via Interactive Concept TransferDinh-Khoi Vo*, Duy-Nam Ly*, Khanh-Duy Le, Tam V. Nguyen, Minh-Triet Tran, and 1 more authorIn ACM Conference on Human Factors in Computing Systems (CHI), 2024(A* Rank, Late Breaking Work)
Creating thematic collections in industries demands innovative designs and cohesive concepts. Designers may face challenges in maintaining thematic consistency when drawing inspiration from existing objects, landscapes, or artifacts. While AI-powered graphic design tools offer help, they often fail to generate cohesive sets based on specific thematic concepts. In response, we introduce iCONTRA, an interactive CONcept TRAnsfer system. With a user-friendly interface, iCONTRA enables both experienced designers and novices to effortlessly explore creative design concepts and efficiently generate thematic collections. We also propose a zero-shot image editing algorithm, eliminating the need for fine-tuning models, which gradually integrates information from initial objects, ensuring consistency in the generation process without influencing the background. A pilot study suggests iCONTRA’s potential to reduce designers’ efforts. Experimental results demonstrate its effectiveness in producing consistent and high-quality object concept transfers. iCONTRA stands as a promising tool for innovation and creative exploration in thematic collection design.
@inproceedings{Vo2024iCONTRA, title = {iCONTRA: Toward Thematic Collection Design Via Interactive Concept Transfer}, author = {Vo, Dinh-Khoi and Ly, Duy-Nam and Le, Khanh-Duy and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM Conference on Human Factors in Computing Systems (CHI)}, year = {2024}, note = {(A* Rank, Late Breaking Work)}, presentation = {https://www.youtube.com/watch?v=ZqKlhfm2cB4}, project_page = {https://github.com/vdkhoi20/iCONTRA} } - CHIARtVista: Gateway To Empower Anyone Into ArtistTrong-Vu Hoang*, Quang-Binh Nguyen*, Duy-Nam Ly, Khanh-Duy Le, Tam V. Nguyen, and 2 more authorsIn ACM Conference on Human Factors in Computing Systems (CHI), 2024(A* Rank, Late Breaking Work)
Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVista not only recommends reference images aligned with users’ abstract ideas and generates sketches for users to draw but also goes beyond, crafting vibrant paintings in various painting styles. ARtVista also offers users an alternative approach to create striking paintings by simulating the paint-by-number concept on reference images, empowering users to create visually stunning artwork devoid of the necessity for advanced drawing skills. We perform a pilot study and reveal positive feedback on its usability, emphasizing its effectiveness in visualizing user ideas and aiding the painting process to achieve stunning pictures without requiring advanced drawing skills.
@inproceedings{Hoang2024ARtVista, title = {ARtVista: Gateway To Empower Anyone Into Artist}, author = {Hoang, Trong-Vu and Nguyen, Quang-Binh and Ly, Duy-Nam and Le, Khanh-Duy and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {ACM Conference on Human Factors in Computing Systems (CHI)}, year = {2024}, note = {(A* Rank, Late Breaking Work)}, presentation = {https://www.youtube.com/watch?v=vpL-Ttvp6Ds}, project_page = {https://github.com/htrvu/ARtVista} } - ISBIPISeg: Polyp Instance Segmentation with Texture Denoising and Adaptive RegionTan-Cong Nguyen, Kim Anh Phung, Tien-Phat Nguyen, Thao Dao, Cong Nhan Pham, and 5 more authorsIn IEEE International Symposium on Biomedical Imaging (ISBI), 2024(A Rank)
@inproceedings{Nguyen2024PISeg, title = {PISeg: Polyp Instance Segmentation with Texture Denoising and Adaptive Region}, author = {Nguyen, Tan-Cong and Phung, Kim Anh and Nguyen, Tien-Phat and Dao, Thao and Pham, Cong Nhan and Nguyen, Quang-Thuc and Le, Trung-Nghia and Shen, Ju and Nguyen, Tam V. and Tran, Minh-Triet}, booktitle = {IEEE International Symposium on Biomedical Imaging (ISBI)}, year = {2024}, note = {(A Rank)} } - AAAIMaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance SegmentationMinh-Quan Le, Tam V. Nguyen, Trung-Nghia Le, Thanh-Toan Do, Minh N. Do, and 1 more authorIn AAAI Conference on Artificial Intelligence, 2024(A* Rank, Oral)
Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (eg mean K-shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and K-shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods.
@inproceedings{Le2024MaskDiff, title = {MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation}, author = {Le, Minh-Quan and Nguyen, Tam V. and Le, Trung-Nghia and Do, Thanh-Toan and Do, Minh N. and Tran, Minh-Triet}, booktitle = {AAAI Conference on Artificial Intelligence}, year = {2024}, note = {(A* Rank, Oral)}, project_page = {https://github.com/minhquanlecs/MaskDiff} } - MediaEvalMedico Multimedia Task at MediaEval 2023: Transparent Tracking of SpermatozoaVajira Thambawita, Andrea Storås, Tuan-Luc Huynh, Hai-Dang Nguyen, Minh-Triet Tran, and 5 more authorsIn Multimedia Evaluation Workshop (MediaEval), 2024
@inproceedings{Thambawita2024MediaEval, title = {Medico Multimedia Task at MediaEval 2023: Transparent Tracking of Spermatozoa}, author = {Thambawita, Vajira and Storås, Andrea and Huynh, Tuan-Luc and Nguyen, Hai-Dang and Tran, Minh-Triet and Le, Trung-Nghia and Halvorsen, Pål and Riegler, Michael and Hicks, Steven and Tran, Thien-Phuc}, booktitle = {Multimedia Evaluation Workshop (MediaEval)}, year = {2024}, } - MMMNearbyPatchCL: Leveraging Nearby Patches for Self-Supervised Patch-Level Multi-Class Classification in Whole-Slide ImagesGia-Bao Le*, Van-Tien Nguyen*, Trung-Nghia Le, and Minh-Triet TranIn International Conference on Multimedia Modeling (MMM), 2024(B Rank, Oral)
Whole-slide image (WSI) analysis plays a crucial role in cancer diagnosis and treatment. In addressing the demands of this critical task, self-supervised learning (SSL) methods have emerged as a valuable resource, leveraging their efficiency in circumventing the need for a large number of annotations, which can be both costly and time-consuming to deploy supervised methods. Nevertheless, patch-wise representation may exhibit instability in performance, primarily due to class imbalances stemming from patch selection within WSIs. In this paper, we introduce Nearby Patch Contrastive Learning (NearbyPatchCL), a novel self-supervised learning method that leverages nearby patches as positive samples and a decoupled contrastive loss for robust representation learning. Our method demonstrates a tangible enhancement in performance for downstream tasks involving patch-level multi-class classification. Additionally, we curate a new dataset derived from WSIs sourced from the Canine Cutaneous Cancer Histology, thus establishing a benchmark for the rigorous evaluation of patch-level multi-class classification methodologies. Intensive experiments show that our method significantly outperforms the supervised baseline and state-of-the-art SSL methods with top-1 classification accuracy of 87.56%. Our method also achieves comparable results while utilizing a mere 1% of labeled data, a stark contrast to the 100% labeled data requirement of other approaches.
@inproceedings{Le2024NearbyPatchCL, title = {NearbyPatchCL: Leveraging Nearby Patches for Self-Supervised Patch-Level Multi-Class Classification in Whole-Slide Images}, author = {Le, Gia-Bao and Nguyen, Van-Tien and Le, Trung-Nghia and Tran, Minh-Triet}, booktitle = {International Conference on Multimedia Modeling (MMM)}, year = {2024}, note = {(B Rank, Oral)}, project_page = {https://github.com/nvtien457/NearbyPatchCL} }
2023
- C&GSketchANIMAR: Sketch-based 3D Animal Fine-Grained RetrievalTrung-Nghia Le, Tam V. Nguyen, Minh-Quan Le, Trong-Thuan Nguyen, Viet-Tham Huynh, and 29 more authorsComputers & Graphics (Special Section on 3DOR 2023), 2023(Q2, IF = 2.62 in 2022)
The retrieval of 3D objects has gained significant importance in recent years due to its broad range of applications in computer vision, computer graphics, virtual reality, and augmented reality. However, the retrieval of 3D objects presents significant challenges due to the intricate nature of 3D models, which can vary in shape, size, and texture, and have numerous polygons and vertices. To this end, we introduce a novel SHREC challenge track that focuses on retrieving relevant 3D animal models from a dataset using sketch queries and expedites accessing 3D models through available sketches. Furthermore, a new dataset named ANIMAR was constructed in this study, comprising a collection of 711 unique 3D animal models and 140 corresponding sketch queries. Our contest requires participants to retrieve 3D models based on complex and detailed sketches. We receive satisfactory results from eight teams and 204 runs. Although further improvement is necessary, the proposed task has the potential to incentivize additional research in the domain of 3D object retrieval, potentially yielding benefits for a wide range of applications. We also provide insights into potential areas of future research, such as improving techniques for feature extraction and matching and creating more diverse datasets to evaluate retrieval performance.
@article{Le2023SketchANIMAR, title = {SketchANIMAR: Sketch-based 3D Animal Fine-Grained Retrieval}, author = {Le, Trung-Nghia and Nguyen, Tam V. and Le, Minh-Quan and Nguyen, Trong-Thuan and Huynh, Viet-Tham and Do, Trong-Le and Le, Khanh-Duy and Tran, Mai-Khiem and Hoang-Xuan, Nhat and Nguyen-Ho, Thang-Long and Nguyen, Vinh-Tiep and Le-Pham, Nhat-Quynh and Pham, Huu-Phuc and Hoang, Trong-Vu and Nguyen, Quang-Binh and Nguyen-Mau, Trong-Hieu and Huynh, Tuan-Luc and Le, Thanh-Danh and Nguyen-Ha, Ngoc-Linh and Truong-Thuy, Tuong-Vy and Phong, Truong Hoai and Diep, Tuong-Nghiem and Ho, Khanh-Duy and Nguyen, Xuan-Hieu and Tran, Thien-Phuc and Yang, Tuan-Anh and Tran, Kim-Phat and Hoang, Nhu-Vinh and Nguyen, Minh-Quang and Vo, Hoai-Danh and Doan, Minh-Hoa and Nguyen, Hai-Dang and Sugimoto, Akihiro and Tran, Minh-Triet}, journal = {Computers & Graphics (Special Section on 3DOR 2023)}, year = {2023}, note = {(Q2, IF = 2.62 in 2022)}, } - C>extANIMAR: Text-based 3D Animal Fine-Grained RetrievalTrung-Nghia Le, Tam V. Nguyen, Minh-Quan Le, Trong-Thuan Nguyen, Viet-Tham Huynh, and 28 more authorsComputers & Graphics (Special Section on 3DOR 2023), 2023(Q2, IF = 2.62 in 2022)
3D object retrieval is an important yet challenging task that has drawn more and more attention in recent years. While existing approaches have made strides in addressing this issue, they are often limited to restricted settings such as image and sketch queries, which are often unfriendly interactions for common users. In order to overcome these limitations, this paper presents a novel SHREC challenge track focusing on text-based fine-grained retrieval of 3D animal models. Unlike previous SHREC challenge tracks, the proposed task is considerably more challenging, requiring participants to develop innovative approaches to tackle the problem of text-based retrieval. Despite the increased difficulty, we believe this task can potentially drive useful applications in practice and facilitate more intuitive interactions with 3D objects. Five groups participated in our competition, submitting a total of 114 runs. While the results obtained in our competition are satisfactory, we note that the challenges presented by this task are far from fully solved. As such, we provide insights into potential areas for future research and improvements. We believe we can help push the boundaries of 3D object retrieval and facilitate more user-friendly interactions via vision-language technologies.
@article{Le2023TextANIMAR, title = {TextANIMAR: Text-based 3D Animal Fine-Grained Retrieval}, author = {Le, Trung-Nghia and Nguyen, Tam V. and Le, Minh-Quan and Nguyen, Trong-Thuan and Huynh, Viet-Tham and Do, Trong-Le and Le, Khanh-Duy and Tran, Mai-Khiem and Hoang-Xuan, Nhat and Nguyen-Ho, Thang-Long and Nguyen, Vinh-Tiep and Diep, Tuong-Nghiem and Ho, Khanh-Duy and Nguyen, Xuan-Hieu and Tran, Thien-Phuc and Yang, Tuan-Anh and Tran, Kim-Phat and Hoang, Nhu-Vinh and Nguyen, Minh-Quang and Nguyen, E-Ro and Nguyen-Nhat, Minh-Khoi and To, Tuan-An and Huynh-Le, Trung-Truc and Nguyen, Nham-Tan and Luong, Hoang-Chau and Phong, Truong Hoai and Le-Pham, Nhat-Quynh and Pham, Huu-Phuc and Hoang, Trong-Vu and Nguyen, Quang-Binh and Nguyen, Hai-Dang and Sugimoto, Akihiro and Tran, Minh-Triet}, journal = {Computers & Graphics (Special Section on 3DOR 2023)}, year = {2023}, note = {(Q2, IF = 2.62 in 2022)}, } - IEEE OJSPPurifying Adversarial Images using Adversarial Autoencoders with Conditional Normalizing FlowsYi Ji, Trung-Nghia Le, Huy H. Nguyen, and Isao EchizenIEEE Open Journal of Signal Processing, 2023(ICIP, Journal Track) (Q2, IF = 2.89 in 2022)
We present a target-agnostic adversarial autoencoder with conditional normalizing flows specifically designed to, given any unlabeled image dataset, purify adversarial samples into clean images, i.e., remove adversarial noise from the images while preserving their visual quality. In our model interpretation, samples are processed by manifold projection in which the encoder brings the sample back into a posterior data distribution in latent space so that the sample is less likely to be irregular to the learned representation of any target classifier. Normalizing flows conditioned on top of our hybrid network structure and walk-back training are used to deal with common drawbacks of generative model and autoencoder-based approaches: not only the trade-off between compression loss and over-fitting on training data but also the structural model dependency on dataset classes and labels. Experiments demonstrated that our proposed model is preferable to existing target-agnostic adversarial defense methods particularly for large and unlabeled image datasets.
@article{Le2023AAEFlow, title = {Purifying Adversarial Images using Adversarial Autoencoders with Conditional Normalizing Flows}, author = {Ji, Yi and Le, Trung-Nghia and Nguyen, Huy H. and Echizen, Isao}, journal = {IEEE Open Journal of Signal Processing}, year = {2023}, note = {(ICIP, Journal Track) (Q2, IF = 2.89 in 2022)}, } - AIRImage Synthesis: A Review of Methods, Datasets, Evaluation Metrics, and Future OutlookSamah Saeed Baraheem, Trung-Nghia Le, and Tam V. NguyenArtificial Intelligence Review, 2023(Q1, IF = 12.0 in 2022)
Image synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to solve this challenge at a high level to generate photorealistic images. Different techniques and strategies have been employed to achieve this purpose. Thus, the aim of this paper is to provide a comprehensive review of various image synthesis models covering several aspects. First, the image synthesis concept is introduced. We then review different image synthesis methods divided into three categories: image generation from text, sketch, and other inputs, respectively. Each sub-category is introduced under the proper category based upon the general framework to provide a broad vision of all existing image synthesis methods. Next, brief details of the benchmarked datasets used in image synthesis are discussed along with specifying the image synthesis models that leverage them. Regarding the evaluation, we summarize the metrics used to evaluate the image synthesis models. Moreover, a detailed analysis based on the evaluation metrics of the results of the introduced image synthesis is provided. Finally, we discuss some existing challenges and suggest possible future research directions.
@article{Le2023ImageSynthesisReview, title = {Image Synthesis: A Review of Methods, Datasets, Evaluation Metrics, and Future Outlook}, author = {Baraheem, Samah Saeed and Le, Trung-Nghia and Nguyen, Tam V.}, journal = {Artificial Intelligence Review}, year = {2023}, note = {(Q1, IF = 12.0 in 2022)}, } - SoICTMulti-Branch Network for Imagery Emotion PredictionQuoc-Bao Ninh, Hai-Chan Nguyen, Triet Huynh, and Trung-Nghia LeIn International Symposium on Information and Communication Technology (SoICT), 2023
For a long time, images have proved perfect at both storing and conveying rich semantics, especially human emotions. A lot of research has been conducted to provide machines with the ability to recognize emotions in photos of people. Previous methods mostly focus on facial expressions but fail to consider the scene context, meanwhile scene context plays an important role in predicting emotions, leading to more accurate results. In addition, Valence-Arousal-Dominance (VAD) values offer a more precise quantitative understanding of continuous emotions, yet there has been less emphasis on predicting them compared to discrete emotional categories. In this paper, we present a novel Multi-Branch Network (MBN), which utilizes various source information, including faces, bodies, and scene contexts to predict both discrete and continuous emotions in an image. Experimental results on EMOTIC dataset, which contains large-scale images of people in unconstrained situations labeled with 26 discrete categories of emotions and VAD values, show that our proposed method significantly outperforms state-of-the-art methods with 28.4% in mAP and 0.93 in MAE. The results highlight the importance of utilizing multiple contextual information in emotion prediction and illustrate the potential of our proposed method in a wide range of applications, such as effective computing, human-computer interaction, and social robotics.
@inproceedings{Ninh2023EmotionNet, title = {Multi-Branch Network for Imagery Emotion Prediction}, author = {Ninh, Quoc-Bao and Nguyen, Hai-Chan and Huynh, Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2023}, project_page = {https://github.com/BaoNinh2808/Multi-Branch-Network-for-Imagery-Emotion-Prediction} } - RIVFBudget-Aware Road Semantic Segmentation in Unseen Foggy ScenesTan-Hiep To, Thanh-Nghi Do, Duc-Nghia Ngo, Minh-Triet Tran, and Trung-Nghia LeIn International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2023
In autonomous driving, the reliable and accurate identification of road-related objects plays a crucial role in ensuring safe and efficient navigation services. Unfortunately, traditional semantic segmentation methods often encounter visibility challenges, particularly in adverse weather conditions like fog, leading to an increased frequency of traffic accidents. To address this issue, we propose and evaluate two budget-aware approaches aimed at significantly improving the efficiency and accuracy of road object semantic segmentation under foggy weather conditions. The first approach involves the integration of state-of-the-art image dehazing algorithms, designed to mitigate the adverse effects of fog on input images. This method effectively enhances visibility and clarity, thereby enhancing the segmentation process. The second approach leverages advanced algorithms and models to simulate foggy environments, introducing diversity into the training dataset. By exposing the models to varying degrees of simulated fog, they become more robust and adaptive to real-world foggy conditions, ultimately leading to improved segmentation performance. To assess the effectiveness of these approaches, we employ various well-established models on publicly available datasets to accurately represent challenging foggy scenarios encountered in autonomous driving. Our experimental results demonstrate that most models exhibit noticeable accuracy improvements, with some achieving up to a 20% increase when benefiting from the two proposed solutions.
@inproceedings{To2023BudgetSeg, title = {Budget-Aware Road Semantic Segmentation in Unseen Foggy Scenes}, author = {To, Tan-Hiep and Do, Thanh-Nghi and Ngo, Duc-Nghia and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)}, year = {2023}, } - RIVFEnsemble Learning for Vietnamese Scene Text Spotting in Urban EnvironmentsHieu Nguyen*, Cong-Hoang Ta*, Phuong-Thuy Le-Nguyen*, Minh-Triet Tran, and Trung-Nghia LeIn International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2023
This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real-world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.
@inproceedings{Nguyen2023SceneText, title = {Ensemble Learning for Vietnamese Scene Text Spotting in Urban Environments}, author = {Nguyen, Hieu and Ta, Cong-Hoang and Le-Nguyen, Phuong-Thuy and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)}, year = {2023}, } - PSIVTEfficient 3D Brain Tumor Segmentation with Axial-Coronal-Sagittal EmbeddingTuan-Luc Huynh, Thanh-Danh Le, Tam V. Nguyen, Trung-Nghia Le, and Minh-Triet TranIn Pacific-Rim Symposium on Image and Video Technology (PSIVT), 2023(C Rank - Best Paper Award)
In this paper, we address the crucial task of brain tumor segmentation in medical imaging and propose innovative approaches to enhance its performance. The current state-of-the-art nnU-Net has shown promising results but suffers from extensive training requirements and underutilization of pre-trained weights. To overcome these limitations, we integrate Axial-Coronal-Sagittal convolutions and pre-trained weights from ImageNet into the nnU-Net framework, resulting in reduced training epochs, reduced trainable parameters, and improved efficiency. Two strategies for transferring 2D pre-trained weights to the 3D domain are presented, ensuring the preservation of learned relationships and feature representations critical for effective information propagation. Furthermore, we explore a joint classification and segmentation model that leverages pre-trained encoders from a brain glioma grade classification proxy task, leading to enhanced segmentation performance, especially for challenging tumor labels. Experimental results demonstrate that our proposed methods in the fast training settings achieve comparable or even outperform the ensemble of cross-validation models, a common practice in the brain tumor segmentation literature.
@inproceedings{Huynh2023BrainSeg, title = {Efficient 3D Brain Tumor Segmentation with Axial-Coronal-Sagittal Embedding}, author = {Huynh, Tuan-Luc and Le, Thanh-Danh and Nguyen, Tam V. and Le, Trung-Nghia and Tran, Minh-Triet}, booktitle = {Pacific-Rim Symposium on Image and Video Technology (PSIVT)}, year = {2023}, note = {(C Rank - Best Paper Award)}, } - PSIVTCluster-based Video Summarization with Temporal Context AwarenessHai-Dang Huynh-Lam*, Ngoc-Phuong Ho-Thi*, Minh-Triet Tran, and Trung-Nghia LeIn Pacific-Rim Symposium on Image and Video Technology (PSIVT), 2023(C Rank)
In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart from prior cluster-based summarization methods. The resulting temporal-aware clusters are then utilized to compute the final summary, using simple rules for keyframe selection and frame importance scoring. Experimental results on the SumMe dataset demonstrate the effectiveness of our proposed approach, outperforming existing unsupervised methods and achieving comparable performance to state-of-the-art supervised summarization techniques.
@inproceedings{HuynhLam2023VideoSumm, title = {Cluster-based Video Summarization with Temporal Context Awareness}, author = {Huynh-Lam, Hai-Dang and Ho-Thi, Ngoc-Phuong and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {Pacific-Rim Symposium on Image and Video Technology (PSIVT)}, year = {2023}, note = {(C Rank)}, project_page = {https://github.com/hcmus-thesis-gulu/TAC-SUM} } - ISMARDM-VTON: Distilled Mobile Real-time Virtual Try-OnKhoi-Nguyen Nguyen-Ngoc, Thanh-Tung Phan-Nguyen, Khanh-Duy Le, Tam V. Nguyen, Minh-Triet Tran, and 1 more authorIn International Symposium on Mixed and Augmented Reality (ISMAR), 2023(A* Rank, Nominated for Best Poster)
The fashion e-commerce industry has witnessed significant growth in recent years, prompting exploring image-based virtual try-on techniques to incorporate Augmented Reality (AR) experiences into online shopping platforms. However, existing research has primarily overlooked a crucial aspect - the runtime of the underlying machine-learning model. While existing methods prioritize enhancing output quality, they often disregard the execution time, which restricts their applications on a limited range of devices. To address this gap, we propose Distilled Mobile Real-time Virtual Try-On (DM-VTON), a novel virtual try-on framework designed to achieve simplicity and efficiency. Our approach is based on a knowledge distillation scheme that leverages a strong Teacher network as supervision to guide a Student network without relying on human parsing. Notably, we introduce an efficient Mobile Generative Module within the Student network, significantly reducing the runtime while ensuring high-quality output. Additionally, we propose Virtual Try-on-guided Pose for Data Synthesis to address the limited pose variation observed in training images. Experimental results show that the proposed method can achieve 40 frames per second on a single Nvidia Tesla T4 GPU and only take up 37 MB of memory while producing almost the same output quality as other state-of-the-art methods. DM-VTON stands poised to facilitate the advancement of real-time AR applications, in addition to the generation of lifelike attired human figures tailored for diverse specialized training tasks.
@inproceedings{NguyenNgoc2023DMVTON, title = {DM-VTON: Distilled Mobile Real-time Virtual Try-On}, author = {Nguyen-Ngoc, Khoi-Nguyen and Phan-Nguyen, Thanh-Tung and Le, Khanh-Duy and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Mixed and Augmented Reality (ISMAR)}, year = {2023}, note = {(A* Rank, Nominated for Best Poster)}, presentation = {https://drive.google.com/file/d/1t7oTAvhwegemfhjUyTJ4-WL2lqT91RLn/view?usp=sharing}, demo = {https://drive.google.com/file/d/1N4r5fp9cigpCc8z40nLcy0SSXbf15yqC/view?usp=sharing}, project_page = {https://github.com/KiseKloset/DM-VTON} } - ISMARVIDES: Virtual Interior Design via Natural Language and Visual GuidanceMinh-Hien Le, Chi-Bien Chu, Khanh-Duy Le, Tam V. Nguyen, Minh-Triet Tran, and 1 more authorIn International Symposium on Mixed and Augmented Reality (ISMAR), 2023(A* Rank, Poster)
Interior design is crucial in creating aesthetically pleasing and functional indoor spaces. However, developing and editing interior design concepts requires significant time and expertise. We propose Virtual Interior DESign (VIDES) system in response to this challenge. Leveraging cutting-edge technology in generative AI, our system can assist users in generating and editing indoor scene concepts quickly, given user text description and visual guidance. Using both visual guidance and language as the conditional inputs significantly enhances the accuracy and coherence of the generated scenes, resulting in visually appealing designs. Through extensive experimentation, we demonstrate the effectiveness of VIDES in developing new indoor concepts, changing indoor styles, and replacing and removing interior objects. The system successfully captures the essence of users’ descriptions while providing flexibility for customization. Consequently, this system can potentially reduce the entry barrier for indoor design, making it more accessible to users with limited technical skills and reducing the time required to create high-quality images. Individuals who have a background in design can now easily communicate their ideas visually and effectively present their design concepts.
@inproceedings{Le2023VIDES, title = {VIDES: Virtual Interior Design via Natural Language and Visual Guidance}, author = {Le, Minh-Hien and Chu, Chi-Bien and Le, Khanh-Duy and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Symposium on Mixed and Augmented Reality (ISMAR)}, year = {2023}, note = {(A* Rank, Poster)}, presentation = {https://drive.google.com/file/d/1L9Oc7r8IWWz2lIaGZMVzmhM9ampRu92N/view?usp=sharing}, demo = {https://drive.google.com/file/d/11Cc2yw89_3TMlnXvlpijgUXRkMZ-89wE/view?usp=sharing} } - WACVAnalysis of Master Vein Attacks on Finger Vein Recognition SystemsHuy H. Nguyen, Trung-Nghia Le, Junichi Yamagishi, and Isao EchizenIn Winter Conference on Applications of Computer Vision (WACV), 2023(A Rank)
Finger vein recognition (FVR) systems have been commercially used, especially in ATMs, for customer verification. Thus, it is essential to measure their robustness against various attack methods, especially when a hand-crafted FVR system is used without any countermeasure methods. In this paper, we are the first in the literature to introduce master vein attacks in which we craft a vein-looking image so that it can falsely match with as many identities as possible by the FVR systems. We present two methods for generating master veins for use in attacking these systems. The first uses an adaptation of the latent variable evolution algorithm with a proposed generative model (a multi-stage combination of beta-VAE and WGAN-GP models). The second uses an adversarial machine learning attack method to attack a strong surrogate CNN-based recognition system. The two methods can be easily combined to boost their attack ability. Experimental results demonstrated that the proposed methods alone and together achieved false acceptance rates up to 73.29% and 88.79%, respectively, against Miura’s hand-crafted FVR system. We also point out that Miura’s system is easily compromised by non-vein-looking samples generated by a WGAN-GP model with false acceptance rates up to 94.21%. The results raise the alarm about the robustness of such systems and suggest that master vein attacks should be considered an important security measure.
@inproceedings{Nguyen2023VeinAttack, title = {Analysis of Master Vein Attacks on Finger Vein Recognition Systems}, author = {Nguyen, Huy H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2023}, note = {(A Rank)}, } - WACVCloser Look at the Transferability of Adversarial Examples: How They Fool Different Models DifferentlyFuta Waseda, Sosuke Nishikawa, Trung-Nghia Le, Huy H. Nguyen, and Isao EchizenIn Winter Conference on Applications of Computer Vision (WACV), 2023(A Rank)
Deep neural networks are vulnerable to adversarial examples (AEs), which have adversarial transferability: AEs generated for the source model can mislead another (target) model’s predictions. However, the transferability has not been understood in terms of to which class target model’s predictions were misled (i.e., class-aware transferability). In this paper, we differentiate the cases in which a target model predicts the same wrong class as the source model ("same mistake") or a different wrong class ("different mistake") to analyze and provide an explanation of the mechanism. We find that (1) AEs tend to cause same mistakes, which correlates with "non-targeted transferability"; however, (2) different mistakes occur even between similar models, regardless of the perturbation size. Furthermore, we present evidence that the difference between same mistakes and different mistakes can be explained by non-robust features, predictive but human-uninterpretable patterns: different mistakes occur when non-robust features in AEs are used differently by models. Non-robust features can thus provide consistent explanations for the class-aware transferability of AEs.
@inproceedings{Waseda2023AdversarialTransfer, title = {Closer Look at the Transferability of Adversarial Examples: How They Fool Different Models Differently}, author = {Waseda, Futa and Nishikawa, Sosuke and Le, Trung-Nghia and Nguyen, Huy H. and Echizen, Isao}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2023}, note = {(A Rank)}, }
2022
- ITECurrent Status of Deepfake Generation and Detection (Deepfakeの生成と検出の現状)Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, and Isao EchizenThe Journal of The Institute of Image Information and Television Engineers (ITE), Jul 2022(In Japanese, ISSN 1342-6907) — Special Feature: AI and Cyber Security in the Infodemic Era
@article{Le2022DeepfakeStatus, title = {Current Status of Deepfake Generation and Detection (Deepfakeの生成と検出の現状)}, author = {Le, Trung-Nghia and Nguyen, Huy H. and Yamagishi, Junichi and Echizen, Isao}, journal = {The Journal of The Institute of Image Information and Television Engineers (ITE)}, year = {2022}, volume = {76}, number = {4}, month = jul, note = {(In Japanese, ISSN 1342-6907) — Special Feature: AI and Cyber Security in the Infodemic Era}, } - MVAContextual Guided Segmentation Framework for Semi-supervised Video Instance SegmentationTrung-Nghia Le, Tam V. Nguyen, and Minh-Triet TranMachine Vision and Applications (MVA), Jul 2022(Q2, IF = 3.3 in 2022)
In this paper, we propose Contextual Guided Segmentation (CGS) framework for video instance segmentation in three passes. In the first pass, i.e., preview segmentation, we propose Instance Re-Identification Flow to estimate main properties of each instance (i.e., human/non-human, rigid/deformable, known/unknown category) by propagating its preview mask to other frames. In the second pass, i.e., contextual segmentation, we introduce multiple contextual segmentation schemes. For human instance, we develop skeleton-guided segmentation in a frame along with object flow to correct and refine the result across frames. For non-human instance, if the instance has a wide variation in appearance and belongs to known categories (which can be inferred from the initial mask), we adopt instance segmentation. If the non-human instance is nearly rigid, we train FCNs on synthesized images from the first frame of a video sequence. In the final pass, i.e., guided segmentation, we develop a novel fined-grained segmentation method on non-rectangular regions of interest (ROIs). The natural-shaped ROI is generated by applying guided attention from the neighbor frames of the current one to reduce the ambiguity in the segmentation of different overlapping instances. Forward mask propagation is followed by backward mask propagation to further restore missing instance fragments due to re-appeared instances, fast motion, occlusion, or heavy deformation. Finally, instances in each frame are merged based on their depth values, together with human and non-human object interaction and rare instance priority. Experiments conducted on the DAVIS Test-Challenge dataset demonstrate the effectiveness of our proposed framework. We achieved the 3rd consistently in the DAVIS Challenges 2017-2019 with 75.4%, 72.4%, and 78.4% in terms of global score, region similarity, and contour accuracy, respectively.
@article{Le2022CGSF, title = {Contextual Guided Segmentation Framework for Semi-supervised Video Instance Segmentation}, author = {Le, Trung-Nghia and Nguyen, Tam V. and Tran, Minh-Triet}, journal = {Machine Vision and Applications (MVA)}, year = {2022}, note = {(Q2, IF = 3.3 in 2022)}, project_page = {https://sites.google.com/view/ltnghia/research/vos} } - Springer Book ChapterRobust Deepfake On Unrestricted Media: Generation And DetectionTrung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, and Isao EchizenIn Frontiers in Fake Media Generation and Detection, Jul 2022
Recent advances in deep learning have led to substantial improvements in deepfake generation, resulting in fake media with a more realistic appearance. Although deepfake media have potential application in a wide range of areas and are drawing much attention from both the academic and industrial communities, it also leads to serious social and criminal concerns. This chapter explores the evolution of and challenges in deepfake generation and detection. It also discusses possible ways to improve the robustness of deepfake detection for a wide variety of media (e.g., in-the-wild images and videos). Finally, it suggests a focus for future fake media research.
@incollection{Le2022FrontiersDeepfake, title = {Robust Deepfake On Unrestricted Media: Generation And Detection}, author = {Le, Trung-Nghia and Nguyen, Huy H. and Yamagishi, Junichi and Echizen, Isao}, booktitle = {Frontiers in Fake Media Generation and Detection}, year = {2022}, book = {https://link.springer.com/book/10.1007/978-981-19-1524-6} } - IEEE T-IPCamouflaged Instance Segmentation In-The-Wild: Dataset, Method, and Benchmark SuiteTrung-Nghia Le, Yubo Cao, Tan-Cong Nguyen, Minh-Quan Le, Khanh-Duy Nguyen, and 3 more authorsIEEE Transactions on Image Processing (T-IP), Jul 2022(Q1, IF = 10.6 in 2022)
This paper pushes the envelope on decomposing camouflaged regions in an image into meaningful components, namely, camouflaged instances. To promote the new task of camouflaged instance segmentation of in-the-wild images, we introduce a dataset, dubbed CAMO++, that extends our preliminary CAMO dataset (camouflaged object segmentation) in terms of quantity and diversity. The new dataset substantially increases the number of images with hierarchical pixel-wise ground truths. We also provide a benchmark suite for the task of camouflaged instance segmentation. In particular, we present an extensive evaluation of state-of-the-art instance segmentation methods on our newly constructed CAMO++ dataset in various scenarios. We also present a camouflage fusion learning (CFL) framework for camouflaged instance segmentation to further improve the performance of state-of-the-art methods.
@article{Le2022CamoPlusPlus, title = {Camouflaged Instance Segmentation In-The-Wild: Dataset, Method, and Benchmark Suite}, author = {Le, Trung-Nghia and Cao, Yubo and Nguyen, Tan-Cong and Le, Minh-Quan and Nguyen, Khanh-Duy and Do, Thanh-Toan and Tran, Minh-Triet and Nguyen, Tam V.}, journal = {IEEE Transactions on Image Processing (T-IP)}, year = {2022}, note = {(Q1, IF = 10.6 in 2022)}, project_page = {https://sites.google.com/view/ltnghia/research/camo_plus_plus} } - MediaEvalTail-Aware Sperm Analysis for Transparent Tracking of SpermatozoaTuan-Luc Huynh, Huu-Hung Nguyen, Xuan-Nhat Hoang, Thao Thi Phuong Dao, Tien-Phat Nguyen, and 4 more authorsIn Multimedia Evaluation Workshop (MediaEval), Jul 2022
@inproceedings{Huynh2022TailAwareSperm, title = {Tail-Aware Sperm Analysis for Transparent Tracking of Spermatozoa}, author = {Huynh, Tuan-Luc and Nguyen, Huu-Hung and Hoang, Xuan-Nhat and Dao, Thao Thi Phuong and Nguyen, Tien-Phat and Huynh, Viet-Tham and Nguyen, Hai-Dang and Le, Trung-Nghia and Tran, Minh-Triet}, booktitle = {Multimedia Evaluation Workshop (MediaEval)}, year = {2022}, } - RIVFMultilingual Communication System with Deaf Individuals Utilizing Natural and Visual LanguagesTuan-Luc Huynh*, Khoi-Nguyen Nguyen-Ngoc*, Chi-Bien Chu*, Minh-Triet Tran, and Trung-Nghia LeIn International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), Jul 2022
According to the World Federation of the Deaf, more than two hundred sign languages exist. Therefore, it is challenging to understand deaf individuals, even proficient sign language users, resulting in a barrier between the deaf community and the rest of society. To bridge this language barrier, we propose a novel multilingual communication system, namely MUGCAT, to improve the communication efficiency of sign language users. By converting recognized specific hand gestures into expressive pictures, which is universal usage and language independence, our MUGCAT system significantly helps deaf people convey their thoughts. To overcome the limitation of sign language usage, which is mostly impossible to translate into complete sentences for ordinary people, we propose to reconstruct meaningful sentences from the incomplete translation of sign language. We also measure the semantic similarity of generated sentences with fragmented recognized hand gestures to keep the original meaning. Experimental results show that the proposed system can work in a real-time manner and synthesize exquisite stunning illustrations and meaningful sentences from a few hand gestures of sign language. This proves that our MUGCAT has promising potential in assisting deaf communication.
@inproceedings{Huynh2022MultilingualDeaf, title = {Multilingual Communication System with Deaf Individuals Utilizing Natural and Visual Languages}, author = {Huynh, Tuan-Luc and Nguyen-Ngoc, Khoi-Nguyen and Chu, Chi-Bien and Tran, Minh-Triet and Le, Trung-Nghia}, booktitle = {International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)}, year = {2022}, } - WIFSRethinking Adversarial Examples for Location Privacy ProtectionTrung-Nghia Le*, Ta Gu*, Huy H. Nguyen, and Isao EchizenIn IEEE International Workshop on Information Forensics and Security (WIFS), Jul 2022
We have investigated a new application of adversarial examples, namely location privacy protection against landmark recognition systems. We introduce mask-guided multimodal projected gradient descent (MM-PGD), in which adversarial examples are trained on different deep models. Image contents are protected by analyzing the properties of regions to identify the ones most suitable for blending in adversarial examples. We investigated two region identification strategies: class activation map-based MM-PGD, in which the internal behaviors of trained deep models are targeted; and human-vision-based MM-PGD, in which regions that attract less human attention are targeted. Experiments on the Places365 dataset demonstrated that these strategies are potentially effective in defending against black-box landmark recognition systems without the need for much image manipulation.
@inproceedings{Le2022AdversarialPrivacy, title = {Rethinking Adversarial Examples for Location Privacy Protection}, author = {Le, Trung-Nghia and Gu, Ta and Nguyen, Huy H. and Echizen, Isao}, booktitle = {IEEE International Workshop on Information Forensics and Security (WIFS)}, year = {2022}, } - ISMARPublic Speaking Simulator with Speech and Audience FeedbackBao Truong, Trung-Nghia Le, Khanh-Duy Le, Minh-Triet Tran, and Tam V. NguyenIn IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Jul 2022(A* Rank, Poster)
Public speaking is one of the most important ways to share ideas with many people in different domains such as education, training, marketing, or healthcare. Being able to master this skill allows the speaker to clearly advocate for their subject and greatly influence others. However, most of the population reported having public speaking anxiety or glossophobia, which prevents them from effectively conveying their messages to others. One of the best solutions is to have a safe and private space to practice speaking in front of others. As a result, this research work is proposed with the overarching goal of providing people with virtual environments to practice in front of simulated audiences. In addition, the proposed work will aim to have live audience feedback and speech analysis details which could be useful for the users. The experiments via a user study provide insights into the proposed public speaking simulator.
@inproceedings{Truong2022PublicSpeakingSim, title = {Public Speaking Simulator with Speech and Audience Feedback}, author = {Truong, Bao and Le, Trung-Nghia and Le, Khanh-Duy and Tran, Minh-Triet and Nguyen, Tam V.}, booktitle = {IEEE International Symposium on Mixed and Augmented Reality (ISMAR)}, year = {2022}, note = {(A* Rank, Poster)}, } - CVPRWGUNNEL: Guided Mixup Augmentation and Multi-View Fusion for Aquatic Animal SegmentationMinh-Quan Le*, Trung-Nghia Le*, Tam V. Nguyen, Isao Echizen, and Minh-Triet TranIn CV4Animal Workshop, CVPR, Jul 2022(Invited Poster)
Recent years have witnessed great advances in object segmentation research. In addition to generic objects, aquatic animals have attracted research attention. Deep learning-based methods are widely used for aquatic animal segmentation and have achieved promising performance. However, there is a lack of challenging datasets for benchmarking. In this work, we build a new dataset dubbed "Aquatic Animal Species." We also devise a novel GUided mixup augmeNtatioN and multi-modEl fusion for aquatic animaL segmentation (GUNNEL) that leverages the advantages of multiple segmentation models to segment aquatic animals effectively and improves the training performance by synthesizing hard samples. Extensive experiments demonstrated the superiority of our proposed framework over existing state-of-the-art instance segmentation methods.
@inproceedings{Le2022GUNNEL, title = {GUNNEL: Guided Mixup Augmentation and Multi-View Fusion for Aquatic Animal Segmentation}, author = {Le, Minh-Quan and Le, Trung-Nghia and Nguyen, Tam V. and Echizen, Isao and Tran, Minh-Triet}, booktitle = {CV4Animal Workshop, CVPR}, year = {2022}, note = {(Invited Poster)}, invited_poster = {https://drive.google.com/file/d/1osYOWDnLn16PVP-fX90s_Kbs-qAIR7KM/view}, }
2021
- JoIMasked Face Analysis via Multi-task Deep LearningVatsa S. Patel, Zhongliang Nie, Trung-Nghia Le, and Tam V. NguyenJournal of Imaging, Jul 2021(Q2, IF = 3.2 in 2022)
Face recognition with wearable items has been a challenging task in computer vision and involves the problem of identifying humans wearing a face mask. Masked face analysis via multi-task learning could effectively improve performance in many fields of face analysis. In this paper, we propose a unified framework for predicting the age, gender, and emotions of people wearing face masks. We first construct FGNET-MASK, a masked face dataset for the problem. Then, we propose a multi-task deep learning model to tackle the problem. In particular, the multi-task deep learning model takes the data as inputs and shares their weight to yield predictions of age, expression, and gender for the masked face. Through extensive experiments, the proposed framework has been found to provide a better performance than other existing methods.
@article{Patel2021MaskedFace, title = {Masked Face Analysis via Multi-task Deep Learning}, author = {Patel, Vatsa S. and Nie, Zhongliang and Le, Trung-Nghia and Nguyen, Tam V.}, journal = {Journal of Imaging}, year = {2021}, note = {(Q2, IF = 3.2 in 2022)}, } - IEEE AccessMirrorNet: Bio-Inspired Camouflaged Object SegmentationJinnan Yan, Trung-Nghia Le, Khanh-Duy Nguyen, Minh-Triet Tran, Thanh-Toan Do, and 1 more authorIEEE Access, Jul 2021(Q1, IF = 3.9 in 2022)
Camouflaged objects are generally difficult to be detected in their natural environment even for human beings. In this paper, we propose a novel bio-inspired network, named the MirrorNet, that leverages both instance segmentation and mirror stream for the camouflaged object segmentation. Differently from existing networks for segmentation, our proposed network possesses two segmentation streams: the main stream and the mirror stream corresponding with the original image and its flipped image, respectively. The output from the mirror stream is then fused into the main stream’s result for the final camouflage map to boost up the segmentation accuracy. Extensive experiments conducted on the public CAMO dataset demonstrate the effectiveness of our proposed network. Our proposed method achieves 89% in accuracy, outperforming the state-of-the-arts.
@article{Yan2021MirrorNet, title = {MirrorNet: Bio-Inspired Camouflaged Object Segmentation}, author = {Yan, Jinnan and Le, Trung-Nghia and Nguyen, Khanh-Duy and Tran, Minh-Triet and Do, Thanh-Toan and Nguyen, Tam V.}, journal = {IEEE Access}, year = {2021}, note = {(Q1, IF = 3.9 in 2022)}, project_page = {https://sites.google.com/view/ltnghia/research/camo} } - FG4COVID19Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing RatioKhanh-Duy Nguyen, Huy H. Nguyen, Trung-Nghia Le, Junichi Yamagishi, and Isao EchizenIn FG4COVID19, Jul 2021
Estimating the mask-wearing ratio in public places is important as it enables health authorities to promptly analyze and implement policies. Methods for estimating the mask-wearing ratio on the basis of image analysis have been reported. However, there is still a lack of comprehensive research on both methodologies and datasets. Most recent reports straightforwardly propose estimating the ratio by applying conventional object detection and classification methods. It is feasible to use regression-based approaches to estimate the number of people wearing masks, especially for congested scenes with tiny and occluded faces, but this has not been well studied. A large-scale and well-annotated dataset is still in demand. In this paper, we present two methods for ratio estimation that leverage either a detection-based or regression-based approach. For the detection-based approach, we improved the state-of-the-art face detector, RetinaFace, used to estimate the ratio. For the regression-based approach, we fine-tuned the baseline network, CSRNet, used to estimate the density maps for masked and unmasked faces. We also present the first large-scale dataset, the "NFM dataset", which contains 581,108 face annotations extracted from 18,088 video frames in 17 street-view videos. Experiments demonstrated that the RetinaFace-based method has higher accuracy under various situations and that the CSRNet-based method has a shorter operation time thanks to its compactness.
@inproceedings{Nguyen2021MaskWearing, title = {Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio}, author = {Nguyen, Khanh-Duy and Nguyen, Huy H. and Le, Trung-Nghia and Yamagishi, Junichi and Echizen, Isao}, booktitle = {FG4COVID19}, year = {2021}, } - ICCVOpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-WildTrung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, and Isao EchizenIn International Conference on Computer Vision (ICCV), Jul 2021(A* Rank, Acceptance rate 25.9%)
The proliferation of deepfake media is raising concerns among the public and relevant authorities. It has become essential to develop countermeasures against forged faces in social media. This paper presents a comprehensive study on two new countermeasure tasks: multi-face forgery detection and segmentation in-the-wild. Localizing forged faces among multiple human faces in unrestricted natural scenes is far more challenging than the traditional deepfake recognition task. To promote these new tasks, we have created the first large-scale dataset posing a high level of challenges that is designed with face-wise rich annotations explicitly for face forgery detection and segmentation, namely OpenForensics. With its rich annotations, our OpenForensics dataset has great potentials for research in both deepfake prevention and general human face detection. We have also developed a suite of benchmarks for these tasks by conducting an extensive evaluation of state-of-the-art instance detection and segmentation methods on our newly constructed dataset in various scenarios.
@inproceedings{Le2021OpenForensics, title = {OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild}, author = {Le, Trung-Nghia and Nguyen, Huy H. and Yamagishi, Junichi and Echizen, Isao}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2021}, note = {(A* Rank, Acceptance rate 25.9%)}, presentation = {https://www.youtube.com/watch?v=bO__OcpESuI}, project_page = {https://sites.google.com/view/ltnghia/research/openforensics} } - CVPRWFashion-Guided Adversarial Attack on Person SegmentationMarc Treu*, Trung-Nghia Le*, Huy H. Nguyen*, Junichi Yamagishi, and Isao EchizenIn CVPR Workshop on Media Forensics, Jul 2021(*Equal Contributions)
This paper presents the first adversarial example based method for attacking human instance segmentation networks, namely person segmentation networks in short, which are harder to fool than classification networks. We propose a novel Fashion-Guided Adversarial Attack (FashionAdv) framework to automatically identify attackable regions in the target image to minimize the effect on image quality. It generates adversarial textures learned from fashion style images and then overlays them on the clothing regions in the original image to make all persons in the image invisible to person segmentation networks. The synthesized adversarial textures are inconspicuous and appear natural to the human eye. The effectiveness of the proposed method is enhanced by robustness training and by jointly attacking multiple components of the target network. Extensive experiments demonstrated the effectiveness of FashionAdv in terms of robustness to image manipulations and storage in cyberspace as well as appearing natural to the human eye.
@inproceedings{Treu2021FashionAttack, title = {Fashion-Guided Adversarial Attack on Person Segmentation}, author = {Treu, Marc and Le, Trung-Nghia and Nguyen, Huy H. and Yamagishi, Junichi and Echizen, Isao}, booktitle = {CVPR Workshop on Media Forensics}, year = {2021}, note = {(*Equal Contributions)}, presentation = {https://www.youtube.com/watch?v=dgQEk1j05kY}, project_page = {https://sites.google.com/view/ltnghia/research/FashionAdv} } - AAAIInteractive Video Object Mask AnnotationTrung-Nghia Le, Tam V. Nguyen, Quoc-Cuong Tran, Lam Nguyen, Trung-Hieu Hoang, and 2 more authorsIn AAAI Conference on Artificial Intelligence, Jul 2021(A* Rank, Demo)
In this paper, we introduce a practical system for interactive video object mask annotation, which can support multiple back-end methods. To demonstrate the generalization of our system, we introduce a novel approach for video object annotation. Our proposed system takes scribbles at a chosen key-frame from the end-users via a user-friendly interface and produces masks of corresponding objects at the key-frame via the Control-Point-based Scribbles-to-Mask (CPSM) module. The object masks at the key-frame are then propagated to other frames and refined through the Multi-Referenced Guided Segmentation (MRGS) module. Last but not least, the user can correct wrong segmentation at some frames, and the corrected mask is continuously propagated to other frames in the video via the MRGS to produce the object masks at all video frames.
@inproceedings{Le2021InteractiveMask, title = {Interactive Video Object Mask Annotation}, author = {Le, Trung-Nghia and Nguyen, Tam V. and Tran, Quoc-Cuong and Nguyen, Lam and Hoang, Trung-Hieu and Le, Minh-Quan and Tran, Minh-Triet}, booktitle = {AAAI Conference on Artificial Intelligence}, year = {2021}, note = {(A* Rank, Demo)}, presentation = {https://www.youtube.com/watch?v=aVWYYFztMBE}, project_page = {https://sites.google.com/view/ltnghia/research/interactive-video-object-mask-annotation} } - AAAICamouFinder: Finding Camouflaged Instances in ImagesTrung-Nghia Le, Vuong Nguyen, Cong Le, Tan-Cong Nguyen, Minh-Triet Tran, and 1 more authorIn AAAI Conference on Artificial Intelligence, Jul 2021(A* Rank, Demo)
In this paper, we investigate the interesting yet challenging problem of camouflaged instance segmentation. To this end, we first annotate the available CAMO dataset at the instance level. We also embed the data augmentation in order to increase the number of training samples. Then, we train different state-of-the-art instance segmentation on the CAMO-instance data. Last but not least, we develop an interactive user interface which demonstrates the performance of different state-of-the-art instance segmentation methods on the task of camouflaged instance segmentation. The users are able to compare the results of different methods on the given input images. Our work is expected to push the envelope of the camouflage analysis problem.
@inproceedings{Le2021CamouFinder, title = {CamouFinder: Finding Camouflaged Instances in Images}, author = {Le, Trung-Nghia and Nguyen, Vuong and Le, Cong and Nguyen, Tan-Cong and Tran, Minh-Triet and Nguyen, Tam V.}, booktitle = {AAAI Conference on Artificial Intelligence}, year = {2021}, note = {(A* Rank, Demo)}, presentation = {https://www.youtube.com/watch?v=RI4nt5MDmwE}, project_page = {https://sites.google.com/view/ltnghia/research/camo_plus_plus} }
2020
- MMText-to-Image Synthesis via Aesthetic LayoutSamah Saeed Baraheem, Trung-Nghia Le, and Tam V. NguyenIn International Conference on Multimedia, Jul 2020(A* Rank, Demo)
In this work, we introduce a practical system which synthesizes an appealing image from natural language descriptions such that the generated image should maintain the aesthetic level of photographs. Our proposed method takes the text from the end-users via a user-friendly interface and produces a set of different label maps via the primary generator PG. Then, choosing a subset from the label maps set is performed through the primary aesthetic appreciation PAA. Next, our subset of label maps is fed into the accessory generator AG, which is the state-of-the-art image-to-image translation. Last but not least, our subset of generated images is ranked via the accessory aesthetic appreciation AAA, and the most appealing image is produced.
@inproceedings{Baraheem2020Text2Image, title = {Text-to-Image Synthesis via Aesthetic Layout}, author = {Baraheem, Samah Saeed and Le, Trung-Nghia and Nguyen, Tam V.}, booktitle = {International Conference on Multimedia}, year = {2020}, note = {(A* Rank, Demo)}, presentation = {https://youtu.be/cD12OWi7PgE}, project_page = {https://sites.google.com/view/ltnghia/research/text-to-image-via-aesthetic-layout} } - CVPRWMulti-Referenced Guided Instance Segmentation Framework for Semi-supervised Video Instance SegmentationMinh-Triet Tran, Trung-Hieu Hoang, Tam V. Nguyen, Trung-Nghia Le, E-Ro Nguyen, and 4 more authorsIn CVPR Workshop on DAVIS Challenge on Video Object Segmentation, Jul 2020(4th place)
In this paper, we propose a novel Multi-ReferencedGuided Instance Segmentation (MR-GIS) framework for the challenging problem of semi-supervised video instance segmentation. Our proposed method consists two passes of segmentation with mask guidance. First, we quickly propagate an initial mask to all frames in a sequence to create an initial segmentation result of the instance. Second, we re-propagate masks with reference to multiple extra samples. We put high confidence reliable frames in the memory pool for reference, namely Reliable Extra Samples. To enhance the consistency of instance masks across frames, we search for mask anomaly in consecutive frames and correct them. Our proposed MR-GIS achieves 76.5, 82.1, and 79.3 in terms of region similarity (J), contour accuracy (F), and global score, respectively, on DAVIS 2020 Challenge dataset, rank 4th in the challenge on semi-supervised task
@inproceedings{Tran2020MultiRefSeg, title = {Multi-Referenced Guided Instance Segmentation Framework for Semi-supervised Video Instance Segmentation}, author = {Tran, Minh-Triet and Hoang, Trung-Hieu and Nguyen, Tam V. and Le, Trung-Nghia and Nguyen, E-Ro and Le, Minh-Quan and Nguyen-Dinh, Hoang-Phuc and Hoang, Xuan-Nhat and Do, Minh N.}, booktitle = {CVPR Workshop on DAVIS Challenge on Video Object Segmentation}, year = {2020}, note = {(4th place)}, presentation = {https://youtu.be/L234FE5uVsc}, leaderboard = {https://davischallenge.org/challenge2020/leaderboards.html}, project_page = {https://sites.google.com/view/ltnghia/research/vos} } - CVPRWiTASK: Intelligent Traffic Analysis Software KitMinh-Triet Tran, Tam V. Nguyen, Trung-Hieu Hoang, Trung-Nghia Le, Khac-Tuan Nguyen, and 22 more authorsIn CVPR Workshop on AI City Challenge, Jul 2020(10th place on Track 1, 26th on Track 2, 5th on Track 4)
Traffic flow analysis is essential for intelligent transportation systems. In this paper, we introduce our Intelligent Traffic Analysis Software Kit (iTASK) to tackle three challenging problems: vehicle flow counting, vehicle re-identification, and abnormal event detection. For the first problem, we propose to real-time track vehicles moving along the desired direction in corresponding motion-of-interests (MOIs). For the second problem, we consider each vehicle as a document with multiple semantic words (i.e., vehicle attributes) and transform the given problem to classical document retrieval. For the last problem, we propose to forward and backward refine anomaly detection using GAN-based future prediction and backward tracking completely stalled vehicle or sudden-change direction, respectively. Experiments on the datasets of traffic flow analysis from AI City Challenge 2020 show our competitive results, namely, S1 score of 0.8297 for vehicle flow counting in Track 1, mAP score of 0.3882 for vehicle re-identification in Track 2, and S4 score of 0.9059 for anomaly detection in Track 4.
@inproceedings{Tran2020iTASK, title = {iTASK: Intelligent Traffic Analysis Software Kit}, author = {Tran, Minh-Triet and Nguyen, Tam V. and Hoang, Trung-Hieu and Le, Trung-Nghia and Nguyen, Khac-Tuan and Dinh, Dat-Thanh and Nguyen, Thanh-An and Nguyen, Hai-Dang and Nguyen, Trong-Tung and Hoang, Xuan-Nhat and Vo-Ho, Viet-Khoa and Do, Trong-Le and Nguyen, Lam and Le, Minh-Quan and Nguyen-Dinh, Hoang-Phuc and Pham, Trong-Thang and Nguyen, Xuan-Vy and Nguyen, E-Ro and Tran, Quoc-Cuong and Tran, Hung and Dao, Hieu and Tran, Mai-Khiem and Nguyen, Quang-Thuc and Vu-Le, The-Anh and Nguyen, Tien-Phat and Diep, Gia-Han and Do, Minh N.}, booktitle = {CVPR Workshop on AI City Challenge}, year = {2020}, note = {(10th place on Track 1, 26th on Track 2, 5th on Track 4)}, track1 = {https://sites.google.com/view/ltnghia/research/vehicle_flow_counting}, track2 = {https://sites.google.com/view/ltnghia/research/vehicle_reid}, track4 = {https://sites.google.com/view/ltnghia/research/anomaly_detection} } - IVAttention R-CNN for Accident DetectionTrung-Nghia Le, Akihiro Sugimoto, Shintaro Ono, and Hiroshi KawasakiIn Intelligent Vehicles Symposium (IV), Jul 2020(B Rank)
Abstract— This paper addresses accident detection where we not only detect objects with classes, but also recognize their characteristic properties. More specifically, we aim at simultaneously detecting object class bounding boxes on roads and recognizing their status such as safe, dangerous, or crashed. To achieve this goal, we construct a new dataset and propose a baseline method for benchmarking the task of accident detection. We design an accident detection network, called Attention R-CNN, which consists of two streams: one is for object detection with classes and one for characteristic property computation. As an attention mechanism capturing contextual information in the scene, we integrate global contexts exploited from the scene into the stream for object detection. This introduced attention mechanism enables us to recognize object characteristic properties. Extensive experiments on the newly constructed dataset demonstrate the effectiveness of our proposed network.
@inproceedings{Le2020AttentionRCNN, title = {Attention R-CNN for Accident Detection}, author = {Le, Trung-Nghia and Sugimoto, Akihiro and Ono, Shintaro and Kawasaki, Hiroshi}, booktitle = {Intelligent Vehicles Symposium (IV)}, year = {2020}, note = {(B Rank)}, presentation = {https://youtu.be/SASAeILzJ58}, project_page = {https://sites.google.com/view/ltnghia/research/accident-detection} } - WACVToward Interactive Self-Annotation For Video Object Bounding Box: Recurrent Self-Learning And Hierarchical Annotation Based FrameworkTrung-Nghia Le, Akihiro Sugimoto, Shintaro Ono, and Hiroshi KawasakiIn Winter Conference on Applications of Computer Vision (WACV), Jul 2020(A Rank)
Amount and variety of training data drastically affect the performance of CNNs. Thus, annotation methods are becoming more and more critical to collect data efficiently. In this paper, we propose a simple yet efficient Interactive Self-Annotation framework to cut down both time and human labor cost for video object bounding box annotation. Our method is based on recurrent self-supervised learning and consists of two processes: automatic process and interactive process, where the automatic process aims to build a supported detector to speed up the interactive process. In the Automatic Recurrent Annotation, we let an off-the-shelf detector watch unlabeled videos repeatedly to reinforce itself automatically. At each iteration, we utilize the trained model from the previous iteration to generate better pseudo ground-truth bounding boxes than those at the previous iteration, recurrently improving self-supervised training the detector. In the Interactive Recurrent Annotation, we tackle the human-in-the-loop annotation scenario where the detector receives feedback from the human annotator. To this end, we propose a novel Hierarchical Correction module, where the annotated frame-distance binarizedly decreases at each time step, to utilize the strength of CNN for neighbor frames. Experimental results on various video datasets demonstrate the advantages of the proposed framework in generating high-quality annotations while reducing annotation time and human labor costs.
@inproceedings{Le2020InteractiveAnnotation, title = {Toward Interactive Self-Annotation For Video Object Bounding Box: Recurrent Self-Learning And Hierarchical Annotation Based Framework}, author = {Le, Trung-Nghia and Sugimoto, Akihiro and Ono, Shintaro and Kawasaki, Hiroshi}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2020}, note = {(A Rank)}, presentation = {https://youtu.be/daL0rFnfpN0}, project_page = {https://sites.google.com/view/ltnghia/research/video-self-annotation} } - ITS JapanLearning-Based Semi-Automatic Annotation and Accident Detection from Driving Video (in Japanese)Trung-Nghia Le, Shintaro Ono, Akihiro Sugimoto, and Hiroshi KawasakiIn 18th ITS Symposium, Japan, Jul 2020
@inproceedings{Le2020SemiAutoAnnotationJP, title = {Learning-Based Semi-Automatic Annotation and Accident Detection from Driving Video (in Japanese)}, author = {Le, Trung-Nghia and Ono, Shintaro and Sugimoto, Akihiro and Kawasaki, Hiroshi}, booktitle = {18th ITS Symposium, Japan}, year = {2020}, project_page = {https://sites.google.com/view/ltnghia/research/accident-detection} }
2019
- CVIUAnabranch Network for Camouflaged Object SegmentationTrung-Nghia Le, Tam V. Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro SugimotoComputer Vision and Image Understanding (CVIU), Jul 2019(Q1, IF = 4.5 in 2024)
Camouflaged objects attempt to conceal their texture into the background and discriminating them from the background is hard even for human beings. The main objective of this paper is to explore the camouflaged object segmentation problem, namely, segmenting the camouflaged object(s) for a given image. This problem has not been well studied in spite of a wide range of potential applications including the preservation of wild animals and the discovery of new species, surveillance systems, search-and-rescue missions in the event of natural disasters such as earthquakes, floods or hurricanes. This paper addresses a new challenging problem of camouflaged object segmentation. To address this problem, we provide a new image dataset of camouflaged objects for benchmarking purposes. In addition, we propose a general end-to-end network, called the Anabranch Network, that leverages both classification and segmentation tasks. Different from existing networks for segmentation, our proposed network possesses the second branch for classification to predict the probability of containing camouflaged object(s) in an image, which is then fused into the main branch for segmentation to boost up the segmentation accuracy. Extensive experiments conducted on the newly built dataset demonstrate the effectiveness of our network using various fully convolutional networks.
@article{Le2019AnabranchNet, title = {Anabranch Network for Camouflaged Object Segmentation}, author = {Le, Trung-Nghia and Nguyen, Tam V. and Nie, Zhongliang and Tran, Minh-Triet and Sugimoto, Akihiro}, journal = {Computer Vision and Image Understanding (CVIU)}, year = {2019}, note = {(Q1, IF = 4.5 in 2024)}, project_page = {https://sites.google.com/view/ltnghia/research/camo} } - CVPRWGuided Instance Segmentation Framework for Semi-Supervised Video Instance SegmentationMinh-Triet Tran, Trung-Nghia Le, Tam V. Nguyen, Vinh Ton-That, Trung-Hieu Hoang, and 6 more authorsIn CVPR Workshop on DAVIS Challenge on Video Object Segmentation, Jul 2019(3rd place)
In this paper, we propose a novel Guided Instance Segmentation (GIS) framework to tackle the challenging problem of semi-supervised video instance segmentation. To improve the accuracy for instance segmentation, we propose to perform fined-grained segmentation on a non-rectangular region of interest (ROI). The natural-shaped ROI is generated by applying guided attention from the neighbor frames of the current one. By this way, our method can reduce the ambiguity in the segmentation of different instances, especially those of the same category, in a regular rectangular region. GIS first performs the normal forward mask propagation as in other instance segmentation methods. Then the backward mask propagation is executed to further restore missing instance fragments. This proposed idea is motivated by the scenarios where an instance reappears in a video sequence: it is initially small due to the far distance, then gradually increases in terms of size. The re-appreared instance can be detected and segmented when it is large enough. By using mask back-propagation, GIS can restore small instance fragments before it is large enough for detection and segmentation. Our proposed GIS achieved 0.724, 0.784, and 0.754 in terms of region similarity (J), contour accuracy (F), and global score, respectively on DAVIS 2019 Challenge dataset, rank 3rd in the challenge. Our method achieved the best scores in Decay of all metrics.
@inproceedings{Tran2019GuidedSeg, title = {Guided Instance Segmentation Framework for Semi-Supervised Video Instance Segmentation}, author = {Tran, Minh-Triet and Le, Trung-Nghia and Nguyen, Tam V. and Ton-That, Vinh and Hoang, Trung-Hieu and Bui, Ngoc-Minh and Do, Trong-Le and Luong, Quoc-An and Nguyen, Vinh-Tiep and Duong, Duc Anh and Do, Minh N.}, booktitle = {CVPR Workshop on DAVIS Challenge on Video Object Segmentation}, year = {2019}, note = {(3rd place)}, leaderboard = {https://davischallenge.org/challenge2019/leaderboards.html}, project_page = {https://sites.google.com/view/ltnghia/research/vos} } - CVPRWVehicle Re-identification with Learned Representation and Spatial Verification and Abnormality Detection with Multi-Adaptive Vehicle Detectors for Traffic Video AnalysisKhac-Tuan Nguyen, Trung-Hieu Hoang, Minh-Triet Tran, Trung-Nghia Le, Ngoc-Minh Bui, and 8 more authorsIn CVPR Workshop on AI City Challenge, Jul 2019(8th place on Track 3 and 25th place on Track 2)
Traffic flow analysis is essential for intelligent transportation systems. In this paper, we propose methods for two challenging problems in traffic flow analysis: vehicle re-identification and abnormal event detection. For the first problem, we propose to combine learned high-level features for vehicle instance representation with hand-crafted local features for spatial verification. For the second problem, we propose to use multiple adaptive vehicle detectors for anomaly proposal and use heuristics properties extracted from anomaly proposals to determine anomaly events. Experiments on the datasets of traffic flow analysis from AI City Challenge 2019 show that our methods achieve mAP of 0.4008 for vehicle re-identification in Track 2, and can detect abnormal events with very high accuracy (F1 = 0.9429) in Track 3.
@inproceedings{Nguyen2019VehicleReID, title = {Vehicle Re-identification with Learned Representation and Spatial Verification and Abnormality Detection with Multi-Adaptive Vehicle Detectors for Traffic Video Analysis}, author = {Nguyen, Khac-Tuan and Hoang, Trung-Hieu and Tran, Minh-Triet and Le, Trung-Nghia and Bui, Ngoc-Minh and Do, Trong-Le and Vo-Ho, Viet-Khoa and Luong, Quoc-An and Tran, Mai-Khiem and Nguyen, Thanh-An and Truong, Thanh-Dat and Nguyen, Vinh-Tiep and Do, Minh N.}, booktitle = {CVPR Workshop on AI City Challenge}, year = {2019}, note = {(8th place on Track 3 and 25th place on Track 2)}, track2 = {https://sites.google.com/view/ltnghia/research/vehicle_reid}, track3 = {https://sites.google.com/view/ltnghia/research/anomaly_detection} } - WACVSemantic Instance Meets Salient Object: Study on Video Semantic Salient Instance SegmentationTrung-Nghia Le, and Akihiro SugimotoIn Winter Conference on Applications of Computer Vision (WACV), Jul 2019(A Rank)
Focusing on only semantic instances that only salient in a scene gains more benefits for robot navigation and self-driving cars than looking at all objects in the whole scene. This paper pushes the envelope on salient regions in a video to decompose them into semantically meaningful components, namely, semantic salient instances. We provide the baseline for the new task of video semantic salient instance segmentation (VSSIS), that is, Semantic Instance - Salient Object (SISO) framework. The SISO framework is simple yet efficient, leveraging advantages of two different segmentation tasks, i.e. semantic instance segmentation and salient object segmentation to eventually fuse them for the final result. In SISO, we introduce a sequential fusion by looking at overlapping pixels between semantic instances and salient regions to have non-overlapping instances one by one. We also introduce a recurrent instance propagation to refine the shapes and semantic meanings of instances, and an identity tracking to maintain both the identity and the semantic meaning of instances over the entire video. Experimental results demonstrated the effectiveness of our SISO baseline, which can handle occlusions in videos. In addition, to tackle the task of VSSIS, we augment the DAVIS-2017 benchmark dataset by assigning semantic ground-truth for salient instance labels, obtaining SEmantic Salient Instance Video (SESIV) dataset. Our SESIV dataset consists of 84 high-quality video sequences with pixel-wisely per-frame ground-truth labels.
@inproceedings{Le2019SemanticSalientInstance, title = {Semantic Instance Meets Salient Object: Study on Video Semantic Salient Instance Segmentation}, author = {Le, Trung-Nghia and Sugimoto, Akihiro}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2019}, note = {(A Rank)}, project_page = {https://sites.google.com/view/ltnghia/research/sesiv} }
2018
- IEEE T-IPVideo Salient Object Detection Using Spatiotemporal Deep FeaturesTrung-Nghia Le, and Akihiro SugimotoIEEE Transactions on Image Processing (T-IP), Jul 2018(Q1, IF = 10.6 in 2022)
This paper presents a method for detecting salient objects in videos where temporal information in addition to spatial information is fully taken into account. Following recent reports on the advantage of deep features over conventional hand-crafted features, we propose a new set of SpatioTemporal Deep (STD) features that utilize local and global contexts over frames. We also propose new SpatioTemporal Conditional Random Field (STCRF) to compute saliency from STD features. STCRF is our extension of CRF to the temporal domain and describes the relationships among neighboring regions both in a frame and over frames. STCRF leads to temporally consistent saliency maps over frames, contributing to the accurate detection of salient objects’ boundaries and noise reduction during detection. Our proposed method first segments an input video into multiple scales and then computes a saliency map at each scale level using STD features with STCRF. The final saliency map is computed by fusing saliency maps at different scale levels. Our experiments, using publicly available benchmark datasets, confirm that the proposed method significantly outperforms state-of-the-art methods. We also applied our saliency computation to the video object segmentation task, showing that our method outperforms existing video object segmentation methods.
@article{Le2018VideoSOD, title = {Video Salient Object Detection Using Spatiotemporal Deep Features}, author = {Le, Trung-Nghia and Sugimoto, Akihiro}, journal = {IEEE Transactions on Image Processing (T-IP)}, year = {2018}, note = {(Q1, IF = 10.6 in 2022)}, project_page = {https://drive.google.com/open?id=1-aZVmRa07TlxzM2kcGc_6Zxh_Elsa9Od} } - CVPRWContext-based Instance Segmentation in Video SequenceMinh-Triet Tran, Vinh Ton-That, Trung-Nghia Le, Khac-Tuan Nguyen, Tu V. Ninh, and 4 more authorsIn CVPR Workshop on DAVIS Challenge on Video Object Segmentation, Jul 2018(6th place)
In this work, we propose Context-based Instance Segmentation for video object segmentation in two passes. Namely, in the first pass, we estimate the main properties of each instance (i.e., human/non-human, rigid/deformable, known/unknown category) by propagating its initial mask to other frames. We employ Instance Re-Identification Flow in this pass. The result of the first pass helps our system to automatically select the appropriate scheme for instance segmentation in the second pass. In the second pass, we process human and non-human instances separately. For human instance, we employ Mask R-CNN to extract human segments, OpenPose to merge fragments (in a frame), and object flow to correct and refine the result across frames. For non-human instance, if the instance has a wide variation in its appearance and it belongs to known categories (which can be inferred from the initial mask), we use Mask-RCNN for instance segmentation. If the instance is nearly rigid, we synthesize images from the first frame of a video sequence. We use affine and non-rigid deformations, together with illumination changes, to generate variants of the initial mask. To choose appropriate background for synthesized images, we retrieve images from the Places365 dataset having the similar scene category and scene attributes with the original frame. FCNs, including DeepLab2 and OSVOS, are trained on our synthesized dataset for each instance. For a deformable object in an unknown category, we reuse the baseline result from the first pass. Finally, instances in each frame are merged based on their depth values, using DCNF-FCSP, together with human and non-human object interaction and rare object priority.
@inproceedings{Tran2018ContextSeg, title = {Context-based Instance Segmentation in Video Sequence}, author = {Tran, Minh-Triet and Ton-That, Vinh and Le, Trung-Nghia and Nguyen, Khac-Tuan and Ninh, Tu V. and Le, Tu-Khiem and Nguyen, Vinh-Tiep and Nguyen, Tam V. and Do, Minh N.}, booktitle = {CVPR Workshop on DAVIS Challenge on Video Object Segmentation}, year = {2018}, note = {(6th place)}, leaderboard = {https://davischallenge.org/challenge2018/leaderboard.html}, project_page = {https://sites.google.com/view/ltnghia/research/vos} } - WACVBalancing Content and Style with Two-Stream FCNs for Style TransferDuc Minh Vo, Trung-Nghia Le, and Akihiro SugimotoIn Winter Conference on Applications of Computer Vision (WACV), Jul 2018(A Rank)
Style transfer is to render given image contents in given styles, and it has an important role in both computer vision fundamental research and industrial applications. Following the success ofdeep learning based approaches, this problem has been re-launched very recently, but still remains a difficult task because of trade-of between preserving contents and faithful rendering of styles. In this paper, we propose an end-to-end two-stream Fully Convolutional Networks (FCNs) aiming at balancing the contributions of the content and the style in rendered images. Our proposed network consists ofthe encoder and decoder parts. The encoder part utilizes a FCN for content and a FCN for style where the two FCNs are independently trained to preserve the semantic content and to learn the faithful style representation in each. The semantic content feature and the style representationfeature are then concatenated adaptively and fed into the decoder to generate style-transferred (stylized) images. In order to train our proposed network, we employ a loss network, the pre-trained VGG-I6, to compute content loss and style loss, both of which are efficiently used for the feature concatenation. Our intensive experiments show that our proposed model generates more balanced stylized images in content and style than state-of-theart methods. Moreover, our proposed network achieves efficiency in speed.
@inproceedings{Vo2018TwoStreamStyleTransfer, title = {Balancing Content and Style with Two-Stream FCNs for Style Transfer}, author = {Vo, Duc Minh and Le, Trung-Nghia and Sugimoto, Akihiro}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2018}, note = {(A Rank)}, project_page = {https://sites.google.com/view/ltnghia/research/style_transfer} } - HCMUSInstance Segmentation in Video with Human-Pose Guidance and Data Augmentation (in Vietnamese)Minh-Triet Tran, Tu V. Ninh, Tu-Khiem Le, Vinh Ton-That, Khac-Tuan Nguyen, and 2 more authorsIn Scientific Conference of University of Science, VNU-HCM, Vietnam, Jul 2018
https://drive.google.com/open?id=1S-nzUGfllEHJPXnSuFQf6b9B8cX1eehc
@inproceedings{Tran2018PoseGuidedSegmentationVN, title = {Instance Segmentation in Video with Human-Pose Guidance and Data Augmentation (in Vietnamese)}, author = {Tran, Minh-Triet and Ninh, Tu V. and Le, Tu-Khiem and Ton-That, Vinh and Nguyen, Khac-Tuan and Le, Trung-Nghia and Nguyen, Tam V.}, booktitle = {Scientific Conference of University of Science, VNU-HCM, Vietnam}, year = {2018}, project_page = {https://sites.google.com/view/ltnghia/research/vos} }
2017
- BMVCDeeply Supervised 3D Recurrent FCN for Salient Object Detection in VideosTrung-Nghia Le, and Akihiro SugimotoIn British Machine Vision Conference (BMVC), Jul 2017(A Rank)
This paper presents a novel end-to-end 3D fully convolutional network for salient object detection in videos. The proposed network uses 3D filters in the spatiotemporal domain to directly learn both spatial and temporal information to have 3D deep features, and transfers the 3D deep features to pixel-level saliency prediction, outputting saliency voxels. In our network, we combine the refinement at each layer and deep supervision to efficiently and accurately detect salient object boundaries. The refinement module recurrently enhances to learn contextual information into the feature map. Applying deeply-supervised learning to hidden layers, on the other hand, improves details of the intermediate saliency voxel, and thus the saliency voxel is refined progressively to become finer and finer. Intensive experiments using publicly available benchmark datasets confirm that our network outperforms state-of-the-art methods. The proposed saliency model also effectively works for video object segmentation.
@inproceedings{Le2017Deep3DRFCN, title = {Deeply Supervised 3D Recurrent FCN for Salient Object Detection in Videos}, author = {Le, Trung-Nghia and Sugimoto, Akihiro}, booktitle = {British Machine Vision Conference (BMVC)}, year = {2017}, note = {(A Rank)}, results = {https://drive.google.com/open?id=1-aZVmRa07TlxzM2kcGc_6Zxh_Elsa9Od}, project_page = {https://sites.google.com/view/ltnghia/research/3d_saliency} } - CVPRWInstance Re-Identification Flow for Video Object SegmentationTrung-Nghia Le, Khac-Tuan Nguyen, Manh-Hung Nguyen-Phan, That-Vinh Ton, Toan-Anh Nguyen, and 7 more authorsIn CVPR Workshop on DAVIS Challenge on Video Object Segmentation, Jul 2017(3rd place)
In this work, we propose an Instance Re-Identification Flow (IRIF) for video object segmentation. For the instance re-identification task, we focus on two main categories: human and non-human object instances. We track each instance and detect it when it re-appears to determine its corresponding bounding box in video frames. When a non-human object re-appears, we use a list of recent SVM classifiers to segment that object. Otherwise, we use Pyramid Scene Parsing (PSP) Network to automatically segment that person as an initial mask to continue mask propagation. In particular, we use object detector, Faster R-CNN, to detect person and extract person attribute as a key feature for both tracking and re-identification. In addition, DeepFlow and Deformable Part Model (DPM) are utilized to track and detect non-human objects. Regarding object segmentation, we adopt multi-SVM classifiers embedding history reference with several unary components, namely, saliency, CNN features, location and color, to segment each object instance within its possible bounding box in each frame. Note that we also estimate the z-order of each instance to enhance the later instance tracking and mask propagation. Boundary snapping is adopted to further refine instance shapes. Finally, our IRIF method achieves very competitive results in DAVIS Challenge 2017, namely, 0.615, 0.662, and 0.638 in terms of region similarity (Jaccard index), contour accuracy (F-measure), and global score, respectively
@inproceedings{Le2017InstanceReIDFlow, title = {Instance Re-Identification Flow for Video Object Segmentation}, author = {Le, Trung-Nghia and Nguyen, Khac-Tuan and Nguyen-Phan, Manh-Hung and Ton, That-Vinh and Nguyen, Toan-Anh and Trinh, Xuan-Son and Dinh, Quang-Hieu and Nguyen, Vinh-Tiep and Duong, Anh-Duc and Sugimoto, Akihiro and Nguyen, Tam V. and Tran, Minh-Triet}, booktitle = {CVPR Workshop on DAVIS Challenge on Video Object Segmentation}, year = {2017}, note = {(3rd place)}, leaderboard = {https://davischallenge.org/challenge2017/leaderboard.html}, project_page = {https://sites.google.com/view/ltnghia/research/vos} } - ICMEWSpatiotemporal Utilization of Deep Features for Video Saliency DetectionTrung-Nghia Le, and Akihiro SugimotoIn ICME Workshop on Deep Learning for Intelligent Multimedia Analytics (DeLIMMA), Jul 2017(Oral presentation)
This paper presents a method for detecting salient objects in a video where temporal information in addition to spatial information is fully taken into account. Following recent reports on the advantage of deep features over conventional hand-crafted features, we propose the SpatioTemporal deep Feature (STF feature) that utilizes local and global contexts over frames. With this feature, we compute the saliency map for each frame through supervised learning of the Random Forest. We then refine the saliency maps using our proposed Spa-tioTemporal Conditional Random Field (STCRF). STCRF is our extension of CRF toward the temporal domain and formulates relationship between neighboring regions both in a frame and over frames. STCRF leads to temporally consistent saliency maps over frames, contributing to detect boundaries of salient objects accurately and to reduce noise. Our intensive experiments using publicly available benchmark datasets confirm that our proposed method significantly outperforms state-of-the-art methods.
@inproceedings{Le2017SpatiotemporalSaliency, title = {Spatiotemporal Utilization of Deep Features for Video Saliency Detection}, author = {Le, Trung-Nghia and Sugimoto, Akihiro}, booktitle = {ICME Workshop on Deep Learning for Intelligent Multimedia Analytics (DeLIMMA)}, year = {2017}, note = {(Oral presentation)}, demo = {https://www.youtube.com/watch?v=v9abjOlGdVo} } - Region-Based Multiscale Spatiotemporal Saliency for VideoTrung-Nghia Le, and Akihiro SugimotoarXiv preprint arXiv:1708.01589, Jul 2017
Detecting salient objects from a video requires exploiting both spatial and temporal knowledge included in the video. We propose a novel region-based multiscale spatiotemporal saliency detection method for videos, where static features and dynamic features computed from the low and middle levels are combined together. Our method utilizes such combined features spatially over each frame and, at the same time, temporally across frames using consistency between consecutive frames. Saliency cues in our method are analyzed through a multiscale segmentation model, and fused across scale levels, yielding to exploring regions efficiently. An adaptive temporal window using motion information is also developed to combine saliency values of consecutive frames in order to keep temporal consistency across frames. Performance evaluation on several popular benchmark datasets validates that our method outperforms existing state-of-the-arts.
2015
- PSIVTContrast Based Hierarchical Spatial-Temporal Saliency for VideoTrung-Nghia Le, and Akihiro SugimotoIn Pacific-Rim Symposium on Image and Video Technology (PSIVT), Jul 2015(B Rank) (Oral presentation)
@inproceedings{Le2015ContrastSaliency, title = {Contrast Based Hierarchical Spatial-Temporal Saliency for Video}, author = {Le, Trung-Nghia and Sugimoto, Akihiro}, booktitle = {Pacific-Rim Symposium on Image and Video Technology (PSIVT)}, year = {2015}, note = {(B Rank) (Oral presentation)}, results = {https://drive.google.com/open?id=1-aZVmRa07TlxzM2kcGc_6Zxh_Elsa9Od} }
2014
- ICARCVEssential Keypoints to Enhance Visual Object Recognition with Saliency-based MetricsTrung-Nghia Le, Yen-Thanh Le, Minh-Triet Tran, and Anh-Duc DuongIn International Conference on Control, Automation, Robotics and Vision (ICARCV), Jul 2014(A Rank) (Oral presentation)
@inproceedings{Le2014EssentialKeypoints, title = {Essential Keypoints to Enhance Visual Object Recognition with Saliency-based Metrics}, author = {Le, Trung-Nghia and Le, Yen-Thanh and Tran, Minh-Triet and Duong, Anh-Duc}, booktitle = {International Conference on Control, Automation, Robotics and Vision (ICARCV)}, year = {2014}, note = {(A Rank) (Oral presentation)}, project_page = {https://sites.google.com/view/ltnghia/research/magic_eyes} } - HCIIApplying Saliency-based Region of Interest Detection in Developing a Collaborative Active Learning System with Augmented RealityTrung-Nghia Le, Yen-Thanh Le, and Minh-Triet TranIn International Conference on Human-Computer Interaction (HCII), Jul 2014
Learning activities are not necessary to be only in traditional physical classrooms but can also be set up in virtual environment. Therefore the authors propose a novel augmented reality system to organize a class supporting real-time collaboration and active interaction between educators and learners. A pre-processing phase is integrated into a visual search engine, the heart of our system, to recognize printed materials with low computational cost and high accuracy. The authors also propose a simple yet efficient visual saliency estimation technique based on regional contrast is developed to quickly filter out low informative regions in printed materials. This technique not only reduces unnecessary computational cost of keypoint descriptors but also increases robustness and accuracy of visual object recognition. Our experimental results show that the whole visual object recognition process can be speed up 19 times and the accuracy can increase up to 22%. Furthermore, this pre-processing stage is independent of the choice of features and matching model in a general process. Therefore it can be used to boost the performance of existing systems into real-time manner.
@inproceedings{Le2014ActiveLearningAR, title = {Applying Saliency-based Region of Interest Detection in Developing a Collaborative Active Learning System with Augmented Reality}, author = {Le, Trung-Nghia and Le, Yen-Thanh and Tran, Minh-Triet}, booktitle = {International Conference on Human-Computer Interaction (HCII)}, year = {2014}, project_page = {https://sites.google.com/view/ltnghia/research/magic_eyes} }
2012
- IHMSCApplying Fast Planar Object Detection in Multimedia Augmentation for Products with Mobile DevicesQuoc-Minh Bui, Trung-Nghia Le, Vinh-Tiep Nguyen, Minh-Triet Tran, and Anh-Duc DuongIn International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Jul 2012
Texts, images, audio and video clips about products are important information for customers in shopping. However customers cannot have such information as soon as they see the products in physical stores. The authors propose a Fast Planar Object Detection in Multimedia Augmentation for Products system using mobile devices to provide useful information for customers when they go shopping. The experimental results show the strength of our proposed system in processing and displaying multimedia information on users’ mobile devices in real-time. The system can be used as a smart assistant for customers to get extra useful information about products and help them decide the best choice for their demands.
@inproceedings{Bui2012PlanarDetection, title = {Applying Fast Planar Object Detection in Multimedia Augmentation for Products with Mobile Devices}, author = {Bui, Quoc-Minh and Le, Trung-Nghia and Nguyen, Vinh-Tiep and Tran, Minh-Triet and Duong, Anh-Duc}, booktitle = {International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)}, year = {2012}, } - SoICTAugmented Media for Traditional MagazinesVinh-Tiep Nguyen, Trung-Nghia Le, Quoc-Minh Bui, Minh-Triet Tran, and Anh-Duc DuongIn International Symposium on Information and Communication Technology (SoICT), Jul 2012
Reading traditional newspapers or magazines is a common way to get latest information about events or new products. However these printed materials only provide readers with static information. Readers may want to know more detail information about some product in an article or to watch video clips related to an event mentioned in a news right at the moment when they read that article or news. The authors propose a system with mobile devices that can provide extra information and multimedia for readers by applying augmented reality to traditional magazines. A user can enjoy extra rich multimedia information on a product or news on his/her mobile device just by looking at an article in a traditional magazine through his/her mobile device. The system detects which article in which page of a magazine that is being displayed in a mobile device and provides a reader with related information and multimedia objects. The important feature of our proposed system is using lightweight filter to efficiently filter out candidate covers or articles that do not visually match an image captured by a mobile device. The experiment shows that our proposed system achieves the average accuracy of more than 90% and can process in the real-time manner.
@inproceedings{Nguyen2012AugmentedMedia, title = {Augmented Media for Traditional Magazines}, author = {Nguyen, Vinh-Tiep and Le, Trung-Nghia and Bui, Quoc-Minh and Tran, Minh-Triet and Duong, Anh-Duc}, booktitle = {International Symposium on Information and Communication Technology (SoICT)}, year = {2012}, project_page = {https://sites.google.com/view/ltnghia/research/visual_search} } - PACISSmart Shopping Assistant: A Multimedia and Social Media Augmented System With Mobile Devices to Enhance Customers’ Experience and InteractionVinh-Tiep Nguyen, Trung-Nghia Le, Quoc-Minh Bui, Minh-Triet Tran, and Anh-Duc DuongIn Pacific Asia Conference on Information Systems (PACIS), Jul 2012(A Rank) (Oral presentation)
Multimedia, social media content, and interaction are common means to attract customers in shopping. However these features are not always fully available for customers when they go shopping in physical shopping centers. The authors propose Smart Shopping Assistant, a multimedia and social media augmented system on mobile devices to enhance users’ experience and interaction in shopping. Smart Shopping turns a regular mobile device into a special prism so that a customer can enjoy multimedia, get useful social media related to a product, give feedbacks or make actions on a product during shopping. The system is specified as a flexible framework to take advantages of different visual descriptors and web information extraction modules. Experimental results show that Smart Shopping can process and provide augmented data in a realtime-manner. Smart Shopping can be used to attract more customers and to build an online social community of customers to share their interests in shopping.
@inproceedings{Nguyen2012SmartShopping, title = {Smart Shopping Assistant: A Multimedia and Social Media Augmented System With Mobile Devices to Enhance Customers' Experience and Interaction}, author = {Nguyen, Vinh-Tiep and Le, Trung-Nghia and Bui, Quoc-Minh and Tran, Minh-Triet and Duong, Anh-Duc}, booktitle = {Pacific Asia Conference on Information Systems (PACIS)}, year = {2012}, note = {(A Rank) (Oral presentation)}, project_page = {https://sites.google.com/view/ltnghia/research/visual_search} } - RIVFApplying Virtual Reality for In-Door JoggingTrung-Nghia Le, Quoc-Minh Bui, Vinh-Tiep Nguyen, Minh-Triet Tran, and Anh-Duc DuongIn International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), Jul 2012
We propose an “In-place Virtual Tour” system that generates the smart virtual environment adapting to a user’s activities when jogging. Our proposed system automatically detects the main region of frames captured from a regular camera, analyses the difference in foreground in the main region of consecutive frames to estimate the user’s intensity level, then renders virtual scene with appropriate speed. The system does not require any special devices as existing interactive fitness games. Experimental results demonstrate the effectiveness and the realtime manner of the proposed system. Our proposed method can also be applied to develop various exciting interactive games to stimulate excitement for users when doing in-door fitness exercises or to create virtual environment for museums where visitors just stay in place to explore scenes in museums.
@inproceedings{Le2012IndoorJogging, title = {Applying Virtual Reality for In-Door Jogging}, author = {Le, Trung-Nghia and Bui, Quoc-Minh and Nguyen, Vinh-Tiep and Tran, Minh-Triet and Duong, Anh-Duc}, booktitle = {International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)}, year = {2012}, }