Image Captioning — MCQs | Digital Image Processing

1. Which deep learning model is most commonly used for generating image captions?

(A) Convolutional Neural Network

(B) Recurrent Neural Network

(D) Support Vector Machine

2. Which dataset is widely used for training image captioning models?

(A) ImageNet

(B) COCO

(D) PASCAL VOC

3. In image captioning, the CNN is mainly responsible for:

(A) Text generation

(B) Image classification

(D) Sentence ranking

4. Which neural network component is typically used after CNN in image captioning?

(A) Decision tree

(B) Recurrent Neural Network

(D) Transformer Encoder

5. The goal of image captioning is to generate:

(A) Object coordinates

(B) Semantic segmentation

(D) Class labels

6. Which architecture improves performance in image captioning by handling long-range dependencies?

(A) CNN

(B) RNN

(D) PCA

7. In attention-based models, attention mechanism helps the model:

(A) Filter noise

(B) Focus on relevant parts of the image

(D) Resize the image

8. Which loss function is commonly used for training image captioning models?

(A) Mean Squared Error

(B) Cross Entropy Loss

(D) Dice Loss

9. BLEU score is used in image captioning to measure:

(A) Image quality

(B) Model complexity

(D) Segmentation overlap

10. The attention mechanism in image captioning was introduced in which model?

(A) Show and Tell

(B) Show, Attend and Tell

(D) Visual Genome

11. Image captioning typically combines which two types of data?

(A) Audio and video

(B) Text and metadata

(D) Numeric and symbolic

12. Which model architecture enables parallel training in caption generation?

(A) RNN

(B) LSTM

(D) GAN

13. Which metric evaluates n-gram overlap in caption generation?

(A) SSIM

(B) IoU

(D) PSNR

14. CIDEr metric in image captioning emphasizes:

(A) Text fluency

(B) Syntactic accuracy

(D) Image resolution

15. The encoder in an image captioning model processes:

(A) Captions

(B) Feature vectors

(D) Evaluation metrics

16. The decoder in image captioning is responsible for:

(A) Extracting features

(B) Resizing images

(D) Compressing data

17. Which optimization algorithm is commonly used in training captioning models?

(A) Gradient Boosting

(B) Adam

(D) Simulated Annealing

18. Which of the following is a common challenge in image captioning?

(A) Overfitting in training data

(B) Poor camera quality

(D) Absence of RGB values

19. In image captioning, what is “teacher forcing”?

(A) Manually labeling captions

(B) Feeding actual output during training

(D) Encoding data with noise

20. What does the term “vocabulary” refer to in image captioning?

(A) Set of image labels

(B) Number of input features

(D) Collection of model parameters

21. What is beam search used for in caption generation?

(A) Training optimization

(B) Data augmentation

(D) Evaluation metric

22. The term “Show and Tell” refers to:

(A) A captioning dataset

(B) A training tool

(D) A loss function

23. Which layer captures time-dependent patterns in sequence generation?

(A) Dense Layer

(B) Convolution Layer

(D) Normalization Layer

24. What is the main input to the decoder during testing in image captioning?

(A) True label

(B) Previous word prediction

(D) Entire image

25. Which of the following is a pre-trained model commonly used for feature extraction in image captioning?

(A) VGG16

(B) GPT-2

(D) UNet

26. What does the “context vector” in attention models represent?

(A) Evaluation result

(B) Caption score

(D) Learning rate

27. Why is dropout used in image captioning models?

(A) To reduce model size

(B) To improve image clarity

(D) To increase data throughput

28. Which method improves robustness of captioning models?

(A) Label smoothing

(B) Histogram equalization

(D) Dilation

29. Which of the following is a captioning benchmark dataset?

(A) VOC 2007

(B) Open Images

(D) LFW

30. What role does a tokenizer play in image captioning?

(A) Enhances image edges

(B) Segments objects

(D) Compresses feature vectors

31. Which deep learning technique allows for generating varied captions for the same image?

(A) Deterministic decoding

(B) Greedy search

(D) Image resizing

32. What is “caption diversity” in image captioning?

(A) Image resolution variance

(B) Number of objects detected

(D) Object detection accuracy

33. Which transformer-based model is adapted for image captioning?

(A) BERT

(B) ResNet

(D) Vision Transformer (ViT)

34. What does the term “visual grounding” mean in captioning?

(A) Aligning image regions with textual phrases

(B) Training with GPU

(D) Labeling background

35. Which of the following can improve caption fluency?

(A) Increased dropout

(B) Sentence embedding

(D) Random word shuffling

36. Which evaluation metric accounts for semantic similarity in captions?

(A) BLEU

(B) CIDEr

(D) SSIM

37. Which of the following does not belong to image captioning evaluation metrics?

(A) ROUGE

(B) BLEU

(D) RMSE

38. In a captioning model, which layer is most likely used at the end of decoder?

(A) Softmax layer

(B) Max-pooling layer

(D) Convolutional layer

39. Which word usually marks the start of a generated caption sequence?

‘)” /> (A) [CLS]

” onclick=”checkAnswer(‘q39’, ‘‘)” /> (B)

” onclick=”checkAnswer(‘q39’, ‘‘)” /> (C)

” onclick=”checkAnswer(‘q39’, ‘‘)” /> (D)

40. What does “end-to-end training” mean in image captioning?

(A) Only training CNN part

(B) Only training RNN part

(D) Using pretrained decoder

41. Which of these is used for fine-tuning captions after generation?

(A) Caption synthesizer

(B) Post-processing heuristic

(D) Object detector

42. Which part of the image is mostly used in spatial attention?

(A) Image metadata

(B) Entire image as a single vector

(D) Histogram of intensities

43. Why are hierarchical models used in captioning?

(A) For filtering noise

(B) For faster computation

(D) For compressing features

44. In self-critical sequence training (SCST), reward is computed using:

(A) Decoder weights

(B) CNN loss

(D) Batch normalization

45. Which of the following helps model rare words in captions?

(A) Dropout

(B) Beam width

(D) Feature normalization

46. Which method avoids repetition in captioning outputs?

(A) Greedy decoding

(B) N-gram blocking

(D) Object tracking

47. What is a major limitation of greedy decoding?

(A) High training time

(B) Requires labeled bounding boxes

(D) Increases vocabulary

48. How is the quality of generated captions usually assessed?

(A) Histogram matching

(B) Human evaluation and automated metrics

(D) Model size

49. Which component in an image captioning model interprets visual data into a fixed-size representation?

(A) Decoder

(B) Tokenizer

(D) Softmax Layer

50. Which of the following best describes a key advantage of using Transformers in image captioning?

(A) Faster image rendering

(B) Better spatial resolution

(D) Reduced memory usage

More MCQs on Digital image Processing