Traditional image and video compression methods are designed to maintain the quality of human visual perception, which makes it necessary to reconstruct the image or video before machine analysis. Compression methods oriented towards machine vision tasks make it possible to use the bit stream directly for machine vision tasks, but it is difficult for them to decode high quality images. To bridge the gap between machine vision tasks and signal-level representation, researchers present plenty of the human-machine collaborative compression methods. In order to provide researchers with a comprehensive understanding of this field and promote the development of image and video compression, we present this survey. In this work, we give a problem definition and explore the relationship and application scenarios of different methods. In addition, we provide a comparative analysis of existing methods on compression and machine vision tasks performance. Finally, we provide a discussion of several directions that are most promising for future research.
1 Introduction
In recent years, the data volume of images and videos has experienced explosive growth due to the development of Internet. A large amount of images and videos are produced, stored, transmitted and processed. Thus, image and video compression technology plays an essential role to reduce the bandwidth and space for data transmission and storage while maintaining the visual quality. The traditional aim of image and video compression is to optimize the quality of human visual perception at a certain bit rate, making the quality of compressed image and video close to that of original one. To achieve this goal, a series of traditional compression techniques for images and videos are proposed, such as discrete cosine transform (DCT), motion compensation, inter-frame prediction, quantization and entropy coding. These technologies have made great progress in the past few decades and have formed a series of standards and specifications, such as JPEG [180], JPEG2000 [133], AVC [191], HEVC [20], VVC [21], AV1 [32], AVS3 [209]. These standards collectively have driven the evolution of image and video storage, transmission, and analysis, adequately addressing the human requirements for the quality of images and videos in the digital age. In addition, with the development of deep learning, some efficient compression methods based on neural networks have been proposed [15, 143, 144, 157, 169, 2, 53, 52, 35, 7, 218, 31, 101, 60, 12, 79, 153, 36, 102, 168, 109, 119, 107, 208, 99, 23, 150, 6, 199, 88, 42, 172, 173, 181, 151, 94, 77, 145, 175, 3, 163, 210, 95, 192, 87, 58, 17, 201, 96, 1, 134, 135, 97, 63, 137, 71, 78, 138, 198, 40, 43, 22, 187, 118, 212, 108, 156, 110, 194]. These methods also primarily focus on the quality of human visual reconstruction. When dealing with machine vision tasks, people have to decode the image or video before machine analysis, which hampers the compression process from efficiently fulfilling the requirements of machine vision systems.
The rapid development of artificial intelligence also leads to increasingly widespread applications of machine vision across various domains: deep learning models are employed to tackle complex tasks such as image and video classification [74, 73, 126, 141, 193, 84, 47, 140, 139, 203, 91, 65, 92, 25, 127, 61, 154, 130, 66], object detection [216, 67, 69, 106, 57, 159, 152, 44, 112, 113, 183, 124, 70, 81, 116, 100, 146, 115, 215, 83], and object segmentation [149, 68, 103, 13, 19, 28, 182, 204, 147, 29, 179], which means that machines have become an important recipients and processors of images and videos. However, decoding high-quality images and videos before machine analysis brings significant computational costs, while decoding low-quality images and videos may results in poor feature extraction, thus reducing analysis performance. To meet the diverse requirements of machine vision, relevant image and video compression standards for machine vision are continuously being developed and refined such as CDVS [49] and CDVA [50], which aim to generate compact descriptors to support specific tasks like image and video retrieval and visual search. In addition to standards, the academic community also propose a series of related image and video feature compression methods [8, 33, 34, 76, 161, 166, 167, 206] to improve the analysis efficiency of machine vision. However, the compressed features are unable to reconstruct images or videos to meet human visual demands. Considering the necessity of human-machine collaborative compression, the international organization for standardization established relevant standards for image and video compression technologies. For instance, Moving Picture Experts Group Video Coding for Machines (MPEG VCM) [51] aims to provide efficient video compression and feature extraction techniques to support video data processing and machine vision tasks. Besides, JPEG AI standard [11] is proposed to facilitate the efficient distribution and machine consumption of images. It emphasizes the utilization of advanced image compression methods based on DNN to surpass the compression efficacy of conventional methods. In addition to the above standards, numerous technologies have been proposed to address human-machine collaborative image and video compression issues. As shown in Figure 1, these methods can be categorized into four types based on the components of the compressed information and their decoding approaches: multi-bitstream independent decoding (MBID) [49, 125, 18, 162, 148, 90, 24, 120], multi-bitstream hierarchical decoding (MBHD) [9, 111, 185, 131, 184, 80, 205, 30, 5, 72, 190, 104, 165, 56, 121, 54, 202, 196, 122, 37, 213, 55, 105, 38, 195, 14], single-bitstream multi-head decoding (SBMD) [11, 123, 26, 176], and single-bitstream analysis after reconstruction (SBAR)[132, 186, 62, 59]. These methods not only ensure compression efficiency, but also take into account the needs of both human and machine vision tasks.
Based on the above works, several surveys have summarized the work in the compression field. Some surveys summarize the innovative work in the field of learning based image and video compression [129, 89, 214]. In [214], Zhang et al. summarize and compare perceptually optimized video compression methods. Some surveys take into account of the gap between machine analysis and signal-level reconstruction. Ma et al. [128] provide an overview of joint feature and texture representation frameworks. Dong and Pan [48] summarize the connections between compression and machine vision tasks.
These works provide summaries and outlooks on the field of image and video compression. In recent years, a series of human-machine collaborative encoding methods have been proposed, which can satisfy both high-level and low-level tasks at lower bit rates. On one hand, these methods address the issue that compression techniques oriented towards human vision are inefficient for machine vision tasks. On the other hand, they solve the problem that machine-oriented compression methods have difficulty in reconstructing signal-level representations to a great extent. Therefore, this paper aims to provide a comprehensive overview of human-machine collaborative image and video compression. The main contributions of this paper can be summarized as follows:
We provide a comprehensive review on image and video compression methods that cater to both human visual perception and machine analysis requirements, analyzing the motivations and principles of these methods.
We analyze the performances of the reviewed human-machine collaborative methods on commonly used benchmarks.
We identify some potential challenges and directions in the human-machine collaborative image and video compression domain.
We have made every effort to collect the vast majority of papers related to this field. The rest of this overview is organized as follows: Section 2 defines the problem of human-machine collaborative image and video compression, and introduces relevant metrics for human visual perception and machine analysis. Section 3 introduces the categories of human-machine collaborative image compression and provides analysis of the methods. Section 4 classifies and discusses human-machine collaborative video compression methods. Section 5 provides performance comparisons of these methods. Section 6 discusses remaining challenges and potential research directions and concludes the survey.
2 Foundations of Human-Machine Collaborative Image and Video Compression
2.1 Problem Definition
For human-machine collaborative image and video compression, the most important problem is how to achieve a balance between the quality of human visual reconstruction and the efficiency of machine analysis. Given an image or video x, compression frameworks designed for human recipients primarily aim to minimize the compression bitrate while maintaining high visual quality. Consequently, the optimization objective focused on human visual perception can be articulated as follows:
where R denotes the amount of bits in the bitstream that needs to be transmitted. The bitstream includes compressed image or video data. Sometimes it also contains network information such as the network parameters of Implicit Neural Representation (INR). λ is a balancing parameter, and D measures the distortion between the original image or video x and the reconstructed image or video obtained through compression.
Beyond assessing the reconstruction quality and compression bitrate of the image or video, it is a new trend to consider the requirements of machine vision tasks. For a given set of N machine vision tasks with their corresponding labels Y = {Y1, Y2, …, YN}, we denote F = [F1,F2, …, FN} as the features extracted from x for these tasks and denote Ŷi as the predicted outcome for task i. We define Li(Ŷi, Yi) as the loss for task i in relation to the features and labels Ŷi derived from the decoded image or video. Considering the varying importance of different tasks, we introduce weighting parameters to define the optimization objective for machine vision tasks as follows:
where ωi is weight parameters utilized to balance the significance of each task. R denotes the bitrate of image or video features. By integrating the optimization objectives for human visual reconstruction and machine analysis, we formulate a comprehensive optimization objective function for human-machine collaborative image and video compression, aiming to minimize the bitrate costs and the loss of human and machine vision tasks:
2.2 Compression Performance Metric
We summarize two primary categories of metrics used to evaluate the performance of compression algorithms: human visual metrics and machine analysis metrics, which ensures a comprehensive evaluation of the impact of compression on both human viewers and machine vision tasks.
2.2.1 Human Visual Metric
Human visual metrics are designed to measure the quality of a compressed image or video from the perspective of human viewers. These metrics are crucial for ensuring that compressed content remains visually pleasing. The primary metrics include Peak Signal-to-Noise Ratio (PSNR) [75], Structural Similarity Index (SSIM) [188] and Multi-Scale Structural Similarity (MSSSIM) [189].
2.2.2 Machine Analysis Metrics
For machine analysis, metrics are designed to evaluate the performance of machine vision analysis algorithms on some specific image and video tasks such as classification, object detection, and object segmentation. For classification task, the widely used metric is classification accuracy. For object detection, precision, recall, F1-Score [200], and Intersection over Union (IoU) [155] are employed to measure both the accuracy and the overlap of predicted object boundaries against the ground truth. For segmentation, IoU, Dice Coefficient [158], and Pixel Accuracy are pivotal in measuring the accuracy of boundary delineation and the similarity between predicted and true segmentation.
3 Human-Machine Collaborative Image Compression
In order to obtain compact representations that can support both pixel-level reconstruction and semantic analysis, numerous methods have been proposed. As we mentioned in the first section, these methods can be categorized into four categories: MBID, MBHD, SBMD, SBAR. These methods will be discussed in detail in subsequent sections. Table 1 provides a comprehensive summary of them.
3.1 Multi-bitstream independent decoding
In addition to the bitstream used for image reconstruction, MBID methods introduce an additional independent bitstream by extracting features and compressing them to support high-level tasks. Some methods use local image descriptors for machine vision tasks, such as the Scale-Invariant Feature Transform (SIFT) proposed by Lowe [125] and the Speeded Up Robust Features (SURF) introduced by Bay et al. [18]. Other approaches utilize global image descriptors to summarize high-level image properties for advanced analysis. Sivic and Zisserman [162] address large-scale image search using the bag-of-visual-words (BOV), while Perronnin et al. [148] focuse on compressing Fisher vectors to reduce memory usage and accelerate retrieval, aiming to supplant the bag-of-visual-words technique. Additionally, Jégou et al. [90] design a simplified version of the Fisher kernel representation to tackle the challenge of image search on a very large scale. A representative work is Compact Descriptors for Visual Search (CDVS) [49], which extracts and compresses local and global features into an independent bitstream to support efficient mobile visual search task (Figure 2).
In addition to methods that extract features using traditional computer vision techniques, several learning-based MBID methods have also been proposed in recent years. To support various machine analysis needs across different task scenarios, Liu et al. [120] develop a method to optimize machine vision tasks in the compressed domain. This work could avoid complex decoding processes and directly performing machine vision tasks on compressed representations. Gating modules are used to select features and transformation modules to process images. Besides, it employs knowledge distillation to improve accuracy and support multitask processing. Cao et al. [24] introduce an adjustable multitask image compression method that balances human and machine vision needs on resource-constrained devices. By designing CNN compressors with different channel numbers for machine vision and human vision, this method not only ensures the reconstruction quality but also improves the performance of machine analysis.
In addition to the bitstream used for image reconstruction, MBID methods introduce an additional independent bitstream by extracting features and compressing them to support high-level tasks. Some methods use local image descriptors for machine vision tasks, such as the Scale-Invariant Feature Transform (SIFT) proposed by Lowe [125] and the Speeded Up Robust Features (SURF) introduced by Bay et al. [18]. Other approaches utilize global image descriptors to summarize high-level image properties for advanced analysis. Sivic and Zisserman [162] address large-scale image search using the bag-of-visual-words (BOV), while Perronnin et al. [148] focuse on compressing Fisher vectors to reduce memory usage and accelerate retrieval, aiming to supplant the bag-of-visual-words technique. Additionally, Jégou et al. [90] design a simplified version of the Fisher kernel representation to tackle the challenge of image search on a very large scale. A representative work is Compact Descriptors for Visual Search (CDVS) [49], which extracts and compresses local and global features into an independent bitstream to support efficient mobile visual search task (Figure 2).
3.2 Multi-bitstream hierarchical decoding
MBID methods adds extra machine vision task stream to the traditional human visual reconstruction stream, which increases the storage burdens. In order to avoid this issue, researchers developed the human-machine collaborative image compression methods that supports hierarchical decoding of the stream. Subsets of the compressed stream are utilized to perform machine vision tasks. They can be integrated with the remaining streams to reconstruct images. Among various machine vision tasks, facial tasks are of great significance because of their widespread application in daily life. Researchers propose some methods to process facial images for facial analysis tasks specifically. We will discuss these methods separately. Besides, most methods are designed for general image analysis tasks. Some methods incorporate semantic information, while others employ adaptive frameworks for machine vision tasks. We will discuss them in turn.
3.2.1 Compression methods for facial task
An early work [9] directly extracts features from the HEVC encoded bitstream. This method significantly reduces processing time by skipping traditional decoding steps such as dequantization and inverse transformation. It employs squared patches and convolutional networks for face detection, achieving efficient detection speed and accuracy. It’s particularly suited for processing static images or I-frames of encoded videos.
Similarly, several studies design various methods to extract facial textures for machine vision tasks related to face recognition. Wang et al. [185] introduce a scalable facial image compression approach that includes a basic layer for feature compression and an enhancement layer for texture reconstruction. This method leverages deep learning models for feature extraction and texture information reconstruction. Mao et al. [131] utilize a StyleGAN-based approach to encode face image in scalable style, allowing flexible control over image quality and semantic information through multi-layer encoding. This method provides superior visual performance at extremely low bitrates, and is suitable for low-resolution facial image applications. In addition to directly extracting texture features, other methods improve face reconstruction quality by introducing additional information. Wang et al. [184] introduce a ramework contains basic and enhancement layers. The base layer extract feature for machine vision tasks and coarse reconstrunction. The enhancement layer take the residuals between coarse reconstrunction image and original image as inputs to enhance the texture information. The enhanced residuals are utilized to decode the high quality image in conjunction with the coarse reconstruction image. Fang et al. [54] proposed a face image compression framework. The original image is converted into a designed color sparse sketch using image-to-image transformation. This transformation helps to reduce the redundancy in the image. The sketch can be used for machine vision tasks and reconstruction. The multiscale discriminator of the framework is designed to enhance the detail information. Hu et al. [80] transform images into edge maps and key reference pixels, optimizing feature representation compactness and reducing required encoding bits. This method is able to meet the requirements of machine vision tasks such as facial landmark detection, it also can reconstruct high-quality image. Yang et al. [205] combine generative models and deep learning techniques to achieve ultra-low bitrate facial image compression. It compresses and transmits highly compact feature vectors, which are transformable for machine analysis. This framework mainly supports face segmentation.
3.2.2 Semantic Information Based Compression Methods
Facial image analysis tasks are just one part of machine vision tasks, most of the methods aims to meet the machine analysis requirements for general images, not just facial images. Some researchers designed various frameworks to utilize semantic segmentation for human-machine collaborative image compression. On one hand, the semantic segmentation maps can be used to enhance image quality. On the other hand, this kind of methods can support machine vision tasks such as object segmentation at a lower bit rate. For example, Akbari et al. [5] propose a framework for image compression that utilizes deep learning and semantic segmentation. The input image and its corresponding segmentation map are used to generate a compact representation to obtain a coarse reconstruction of the image. The residuals of coarse reconstruction are transmitted to enhance the visual quality. Based on this work [5], Hoang et al. [72] introduce a method enhancing image reconstruction quality through semantic segmentation. It utilizes specially structured neural networks to map deformation semantic back to the original distribution of semantic segmentation, enhancing the performance of image compression. In 2021, Chen et al. [27] propose an end-to-end mutually enhancing network for image compression and semantic segmentation. This method uses traditional image compression algorithms to compress the input image into a low-bit-rate encoded image. Its semantic segmentation module employs advanced semantic segmentation networks to generate a semantic segmentation map. The enhancement module utilizes the semantic information extracted from the semantic segmentation map to improve the image quality. In addition, Feng et al. [56] explore an image compression method based on irregular group decoupling and customized semantic partitions for efficient image reconstruction. This approach supports object detection and instance segmentation. It also allows the encryption of specific image parts to enhance data security and compression efficiency. In addition to directly using the semantic segmentation map for compression, some works extract advanced semantic information for machine analysis and enhanced reconstruction quality. Tu et al. [177] introduce a cross-layer context model to reduce redundancy and improve compression efficiency. This method takes higher-layer features as cross-layer priors. The compression mechanism is applied only to the ROIs. The generated scalable bitstream can be partially decoded for specific machine vision tasks or fully decoded for human viewing. Chen et al. [30] extract gray-scale profile to satisfy the demind of machine analysis such as classification, detection, and segmentation. Gray-scale profile along with low-level signal features are combined to generate the low quality image. The high quality image is reconstructed using the low quality image and the residual map. Zhang et al. [213] utilize a layered generative approach for machine perception-driven image compression. The method consists of a learning-based layered compression model and a multi-task analysis network. The learning-based layered compression model includes an encoder, a decoder, and a probability estimation model. The encoder encodes the input image into reconstruction part and semantic part. A fusion module is used for reconstruction. The multi-task analysis network is designed to perform machine vision tasks on the compressed representation such as classification and segmentation.
3.2.3 Other Compression Methods
In addition to the two categories mentioned above, there are some methods that make innovations in hierarchical codec framework. Wu et al. [196] propose a task adaptive network to support image compression for both human vision and machine vision tasks. The training process of this network is guided by a teacher network. The quantized latent representation of latent representation can be used to reconstruct different levels of images through multi-scale decoders. Similarly, Wang et al. [190] propose a two-stage approach which contains a feature domain analysis network and a preview image generation network. It encodes the input images into quantized analysis-oriented feature maps, which can be directly used by the machine analysis algorithm without reconstructing the RGB images. Feature residual and feature maps are then combined to reconstruct a high-quality image. Choi and Bajic et al. [38] present a scalable multi-task image compression method. It split the latent space into base part and enhancement part. The base part is used for machine vision tasks and the full latent space is used for reconstruction. The content of the transmission depends on the needs of downstream tasks.
In addition, some other methods make advantage of different deep learning base models to improve image quality and machine analysis accuracy. Bai et al. [14] encodes images into discrete representations and uses the Transformers for decoding and analysis, including dedicated classifiers and reconstructors. A key advantage of this approach is leveraging Transformers’ global information processing capabilities. Lei et al. [104] propose a progressive deep image compression (DIC) scheme for image classification and reconstruction. They utilizes semantics analysis module classifies the input image. Class activation mapping is used to generates a semantic importance map of latent vector. Generative Adversarial Networks (GAN) is adopted to improve perceptual quality by matching the reconstructed image to the input image in the statistical domain.
3.3 Single-bitstream multi-head decoding
The previously discussed method uses multi-stream hierarchical decoding to meet multitask requirements. Besides, some single-stream methods transform the entire stream and utilize different task decoders to address human and machine vision tasks. Torfason et al. [176] explore a method that use the compressed representations for machine inference. Instead of decoding the compressed representation into RGB space, the authors integrate the encoders and decoders of DNN-based compression methods with architectures for image understanding. This approach reduces computational cost and allows for inference on the compressed representations. Liu et al. [123] propose a versatile framework that integrates image compression task and image classification task. The goal is to extract a fully-shared latent representation that supports both compression and classification. The framework extract features and utilize classifier to get compact and general shared latent representations. Similarly, Chen et al. [26] proposed a method to use a trained Transformer-based image codec for machine inference without fine-tuning the codec. The method utilizes prompting techniques to achieve this transfer. The instance prompt is fed into the encoder and the task prompt is fed into the decoder. The decoded image is made suitable for machine vision tasks such as object detection.
3.4 Single-bitstream analysis after reconstruction
The aforementioned SBMD frameworks meet machine vision task requirements with multi-decoders. In addition to these frameworks, there are methods that introduce machine vision task related image information to improve machine analysis performance after image reconstruction.
Mao et al. [132] utilize learned facial image compression methods based on external prior knowledge. It encodes facial images into sketches and thumbnails, and combine them to reconstruction, which improves the quality and analytic performance of reconstruction facial images. Wang et al. [186] propose an end-to-end deep image compression framework for machine vision tasks, which utilizes inverted bottleneck structure to optimize channel distribution. This structure uses compact semantic feature representation to optimize rate-accuracy performance. Guo et al. [62] employ content-adaptive and diffusion techniques for image feature compression. This method allows flexible switching between different perceptual quality standards at extremely low bit rates. It utilizes contrastive learning and pseudo-label techniques significantly enhances the perceptual quality and encoding performance of images.
Coding Optimization-based Research: In addition to the methods based on coding networks, there are also approaches based on coding optimization that enhance human-machine collaborative image compression efficiency by implementing adjustable quantization techniques and other optimization schemes. Li et al. [111] design the texture feature quality index to guide compression. In order to improve both reconstruction quality and recognition accuracy, they combine the HEVC/H.265 standard for texture encoding with scalar quantization and deep feature entropy coding. Lei et al. [105] design an adaptive image compression method. It selects regions of interest (ROI) based on their semantic importance. The encoder and decoder calculate a ROI gain matrix and a ROI inverse gain matrix to control the quantization accuracy of different latent vector elements. Gao et al. [59] design a multitask image compression method, introducing an optimization strategy based on semantic metrics. By adjusting the compression network’s quantization steps and distortion measures through bit allocation and semantic metrics, it reduces distortion while preserving semantic information. The reconstructed images are suited for various machine analysis tasks.
4 Human-Machine Collaborative Video Compression
Compared with images, there is a temporal correlation between video frames. This makes human-machine collaborative image compression methods inadequate to meet the compression requirements of videos. In order to solve this problem, researchers developed several human-machine collaborative video compression methods. Since human-machine collaborative video encoding methods using the SBMD framework have been found yet, the existing human-machine collaborative video compression methods can be classified into three types similarly: MBID, MBHD, and SBAR. Table 2 provides a comprehensive summary of them.
4.1 Multi-bitstream independent decoding
The CDVA standard is a representative of this category of methods. Duan et al. [50] offer a compact and efficient representation of video feature descriptors. It reduces redundancy through keyframe detection and extracting potent deep learning features using convolutional neural networks (CNNs) combined with Nested Invariance Pooling (NIP) technology. This standard optimizes video structure and reduces computational complexity. Similarly, Zhang et al. [211] utilize feature-based affine motion compensation technology to optimize video quality and feature retrieval capabilities. This approach merges video streams and feature data into a bitstream with robust visual retrieval capabilities, which can support local feature descriptors such as SIFT, SURF, and CDVS. Antonio et al. [10] propose a visual objects compression method for smart surveillance applications. Several autoencoders are adopted to produce a compact latent representation of a specific object class.
4.2 Multi-bitstream hierarchical decoding
This kind of method employs hierarchical compression strategies to dynamically adapt to different decoding requirements. Some hierarchical methods analyze intra-frame information and inter-frame relationships within the stream to support analysis and reconstruction. Choi and Bajic et al. [39] propose a two-layer scalable video compression framework, which combines conventional and learning-based video compression techniques. The base layer contains the information related to object detection, and the enhancement layer is designed for high-quality reconstruction. Hadizadeh and Bajic [64] introduce a scalable video compression framework that consists of a base layer and an enhancement layer. In the base layer, the video frames are encoded into a compressed base bitstream. The decoded base frames are utilized by a computer vision model for video analysis, specifically object detection. The enhancement layer compresses the input frames conditionally to generate a compressed bitstream. The enhancement layer’s decoder then reconstructs the output frames for human viewing. In addition, some other hierarchical compression methods improve the performance of video compression by embedding deep semantic information into the compression process. In 2022, Huang et al. [82] proposed a visual compression framework that consists of three layers. The basic layer compress the semantic information for machine vision tasks. The enrichment layer focuses on pixel-level information and is used for tasks such as semantic segmentation and human parsing. Key frames are compressed separately. The visual layer use the decoded content from basic layer and enrichment layer to reconstruct high quality video, which reduces the transmission burden. Besides, Huang et al. [85] proposed a jointly end to end video compression framework. It extracts semantic information between temporal neighboring frames, which can support both signal reconstruction and machine analysis. In 2023, Lin et al. [114] proposed a scalable video compression framework. It consists of three main components: compact representations, scalable bitstream, and video compression. This method extracts compact representations from videos, including semantic features, structure features, and texture features. These representations are then compressed into a scalable bitstream. A conditional semantic compression module is designed to reduce spatial-temporal redundancy of the semantic feature. An interlayer frame prediction module models the interlayer correlation and predicts video frames using the semantic feature. Jin et al. [93] introduced an innovative semantic video compression method that incorporates static and dynamic visual clues into a structured bitstream to support machine vision tasks. By generating a Semantic Structured Bitstream (SSB), this method significantly reduces the cost and complexity of video decompression while enabling direct processing by machine algorithms. Tian et al. [171] employed an autoencoder network that aims to compress videos while preserving the semantic information in an unsupervised manner. The method utilizes a mask autoencoder to learn a compact representation of the video frames. It’s trained with a combination of semantic loss and nonsemantic suppression loss. In 2024, Tian et al. [170] proposed a compression framework that aims to integrate traditional video codecs with neural network-based models, which preserves the semantic content during compression. The authors emphasize the importance of task-decoupled design principles, scalable compression, label-free learning schemes, and effective semantic priors in an AI-task-oriented video compression system. The proposed framework incorporates these principles and aims to provide a versatile compression system that supports diverse tasks.
Furthermore, some hierarchical compression methods utilize innovative compression strategies related to the characteristics of the video. Xia et al. [197] present a joint compression framework for surveillance scenes, which utilizes a learnable sparse motion pattern to guide the generation of video frames through a deep generative model. This approach reduces the total coding cost of both features and videos. Ahmmed et al. [4] present a collaborative video compression method that utilizes cuboidal partitioning. This technique divides video frames into multiple cuboids to extract and encode features, which significantly reduces the bit requirements and computational complexity. This strategy could meet the requirements of both reconstruction and machine vision tasks such as object detection. Ikusan and R. Dai [86] introduce an intermediate feature compression framework, which consists of several components including feature extraction, feature selection, rate-distortion optimization, and video encoding. CNN is used for feature extraction, and hierarchical clustering technology is utilized to select the most relevant features. The selected features are reconstructed for different machine vision tasks.
4.3 Single-bitstream multi-head decoding
Similar to image compression method, some human-machine collaborative video methods are designed with multiple decoding units to cater to various tasks.
Yi et al. [207] introduce a task-driven video compression framework that enhances video quality and compression efficiency through optimized multi-scale motion estimation and multi-frame feature fusion. Moreover, the framework utilizes multitask learning approaches to optimize the encoding process, aiming to balance signal and semantic fidelity. Sheng et al. [160] present VNVC, a multifunctional neural video compression framework that supports various video tasks. The framework includes video reconstruction, enhancement, and analysis module, using a single bitstream with multiple decoding modules. It decodes videos partially into intermediate features that are directly available for downstream tasks, thereby reducing decoding complexity and enhancing task performance.
5 Comparative Analysis of Techniques
In this section, we discuss the details of the performance evaluation of the human-machine collaborative compression methods. First, we introduce some commonly used reconstruction benchmark databases. Next, we provide a detailed discussion of the reconstruction performance of various methods. Then, we introduce various machine vision benchmark databases that are used for the evaluation purpose. Finally, we compare the machine analysis performance of the methods. Some papers provide official open-sourced page and code links. We summarize the links in Table 3.
5.1 Image and video compression methods preformance
5.1.1 Human Oriented Compression Performance
We first compare the performance of compression frameworks in the image domain. When the recipient is human, the compression method aims to address the rate-distortion optimization problem. There are several commonly used image compression databases such as Kodak [98], CLIC2020 [174], and ImageNet [46]. Kodak database consists of 24 high-quality images originally provided by Eastman Kodak Company. These images are typically used to test the performance of various image compression techniques because they include a wide range of real-world scenes and are known for their high resolution and quality.
ImageNet database is a computer vision dataset created by Professor Li of Stanford University. The database contains 14, 197, 122 images and 21, 841 Synset indexes. Synset is a node in the WordNet hierarchy, which is a set of synonyms. The ImageNet dataset has always been a benchmark for evaluating the performance of image classification algorithms. Object information and bounding boxes are also provided.
CLIC2020 database is a part of an annual image compression competition. The database includes a variety of images that test the abilities of compression algorithms under real-world conditions. This database contains images with varying resolutions and lighting conditions, which is comprehensive to assess the performance of learning based image compression methods.
Some papers present compression performance of their methods on Kodak [105, 195, 56, 27, 72, 5, 38, 30, 195, 72]. The rate distortion (RD) curves of these methods are shown in Figure 3.
Similarly, there are some commonly used datasets in the field of video compression such as HEVC [142] and UVG [136]. HEVC Test Sequences are a set of carefully selected video clips specifically designed to evaluate and optimize the performance of HEVC codecs. These sequences cover a wide range of resolutions, from low resolution to ultra-high definition (such as 4K), and include diverse scene types and content, such as motion scenes, natural landscapes, and computer-generated imagery. HEVC Test Sequences played a crucial role during the standardization process of HEVC and serve as essential resources for developing and validating new video compression technologies.
The UVG dataset is a widely used resource in the fields of video compression and quality assessment, which is released by the Ultra Video Group at Tampere University in Finland. This dataset provides high-quality test material for evaluating video encoding, decoding, and quality assessment techniques. The UVG dataset features 4K resolution (3840x2160) video clips with diverse content types, including motion scenes, natural landscapes, computer-generated imagery, and animations. The clips are recorded at a high frame rate (120 fps), allowing researchers to assess codec performance and video quality under high frame rate conditions.
Considering that most paper use HEVC class B as the compression test set, we use RD curves to compare the compression performance of these methods in Figure 4.
5.1.2 Machine Analysis Performance
In addition to support lossy reconstruction of images, human-machine collaborative compression methods also support one or more machine vision tasks such as classification, object detection, and object segmentation. Commonly used databases include: Caltech101 [16], Pascal VOC 2012 [178], COCO [117], LFW [217], Cityscapes [41], UCF101 [164], and MOT17 [45].
Caltech101 contains images from 101 different object categories. Each category includes about 40 to 800 images. Categories range from various types of animals, objects, and scenes. This database is commonly used for image classification and object recognition tasks.
Pascal VOC 2012 is part of the PASCAL Visual Object Classes Challenge. It’s a widely used database for object detection, image segmentation, and classification. It includes images from 20 categories such as animals, vehicles, and household items, with annotations for object detection, segmentation, and classification.
COCO is a large-scale database for object detection, segmentation, and captioning. Most of the images are taken from everyday scenes and natural environments. The database includes label information for object segmentation, object localization, and image captioning.
LFW focuses on face recognition and consists of JPEG images collected over the internet. The person name is labeled. LFW is used for studies in automatic face recognition.
Cityscapes provides a large database of urban street scenes for semantic urban scene understanding. It contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with annotations for semantic urban scene understanding tasks such as segmentation.
UCF101 is a widely used action recognition database. Released by the University of Central Florida, this database contains 13, 320 video clips which belong to 101 action categories, such as sports activities, daily actions, and human-object interactions. The videos are collected from YouTube and offer a diverse range of scenes and camera motions, providing a comprehensive benchmark for evaluating action recognition algorithms.
MOT17 is a benchmark dataset widely used for evaluating multi-object tracking algorithms in computer vision. Released as part of the MOTChallenge, it includes a diverse set of video sequences recorded in various challenging real-world scenarios, such as busy streets and public spaces, with multiple pedestrians and vehicles. Each video is annotated with precise bounding boxes and unique identifiers for each object, providing ground truth for tracking performance evaluation.
We select a number of recent human-machine collaborative image and video compression algorithms from different categories and compare their performance on image classification, object detection, and object segmentation. Tables 4, 5 and 6 displays the machine vision task performance of selected human-machine collaborative image and video compression methods in the corresponding databases. The “-” indicates that the bitrate information is not provided in original paper. Blank space indicates that the method does not support the task.
6 Conclusion and Future Directions
The majority of compressed images and videos are ultimately intended for human viewing or machine processing. To meet the requirements of human visual perception and machine analysis, significant strides have been made in the realm of human-machine collaborative compression. This paper presents and synthesizes recent advancements in human-machine collaborative image and video compression methods. These methods not only ensure visual quality for humans but also boost utility for machine vision tasks such as classification, object detection, and object segmentation. We categorize them into 4 categories. In addition, we summarized comparative evaluations of some advanced methods in various tasks. However, the existing methods primarily focus on conventional visual tasks for images and videos. It might be challenging for them to accomplish machine vision tasks such as video summarization, object counting, and zero-shot classification. Furthermore, the utilization of large models and prior knowledge might be a potential direction. Based on the current development of image and video compression techniques, we think the following content may be promising topic for further improving performance of human and machine collaborative compression methods.
Large models with extensive prior knowledge may be able to further enhance the performance of compression algorithms, particularly in managing complex or low-bitrate images and videos. These models could aid in predicting essential content, optimizing bit allocation, and minimizing visual redundancy.
Cross-model compression may be more suitable to the two kinds of recipients, which can simultaneously hand image/video, audio, point cloud and text, which corresponds to visual information and semantic information. This might boost the compression efficiency and enhances functionalities in applications such as video captioning and multimedia searches. Besides, it could improve the efficiency of intelligent analysis and automated decision making, which makes contribution to the development of industrial applications such as autonomous driving and robotics.
Furthermore, the combination of handcrafted feature representation and deep learning based representation may be able to provide a promising balance between compression performance and generalization for machine vision tasks.
This work was supported by the National Natural Science Foundation under Grant 62071449 and U20A20184, the Fundamental Research Funds for the Central Universities (E2ET1104) and the major project of PCL under Grant PCL2023A08.





