1. Which deep learning model is most commonly used for generating image captions?
(A) Convolutional Neural Network
(B) Recurrent Neural Network
(C) CNN-RNN hybrid model
(D) Support Vector Machine
2. Which dataset is widely used for training image captioning models?
(A) ImageNet
(B) COCO
(C) CIFAR-10
(D) PASCAL VOC
3. In image captioning, the CNN is mainly responsible for:
(A) Text generation
(B) Image classification
(C) Feature extraction from images
(D) Sentence ranking
4. Which neural network component is typically used after CNN in image captioning?
(A) Decision tree
(B) Recurrent Neural Network
(C) Autoencoder
(D) Transformer Encoder
5. The goal of image captioning is to generate:
(A) Object coordinates
(B) Semantic segmentation
(C) Descriptive sentences
(D) Class labels
6. Which architecture improves performance in image captioning by handling long-range dependencies?
(A) CNN
(B) RNN
(C) LSTM
(D) PCA
7. In attention-based models, attention mechanism helps the model:
(A) Filter noise
(B) Focus on relevant parts of the image
(C) Perform faster computation
(D) Resize the image
8. Which loss function is commonly used for training image captioning models?
(A) Mean Squared Error
(B) Cross Entropy Loss
(C) Hinge Loss
(D) Dice Loss
9. BLEU score is used in image captioning to measure:
(A) Image quality
(B) Model complexity
(C) Caption accuracy
(D) Segmentation overlap
10. The attention mechanism in image captioning was introduced in which model?
(A) Show and Tell
(B) Show, Attend and Tell
(C) Deep Caption
(D) Visual Genome
11. Image captioning typically combines which two types of data?
(A) Audio and video
(B) Text and metadata
(C) Visual and textual
(D) Numeric and symbolic
12. Which model architecture enables parallel training in caption generation?
(A) RNN
(B) LSTM
(C) Transformer
(D) GAN
13. Which metric evaluates n-gram overlap in caption generation?
(A) SSIM
(B) IoU
(C) BLEU
(D) PSNR
14. CIDEr metric in image captioning emphasizes:
(A) Text fluency
(B) Syntactic accuracy
(C) Consensus among human captions
(D) Image resolution
15. The encoder in an image captioning model processes:
(A) Captions
(B) Feature vectors
(C) Image input
(D) Evaluation metrics
16. The decoder in image captioning is responsible for:
(A) Extracting features
(B) Resizing images
(C) Generating sentences
(D) Compressing data
17. Which optimization algorithm is commonly used in training captioning models?
(A) Gradient Boosting
(B) Adam
(C) K-means
(D) Simulated Annealing
18. Which of the following is a common challenge in image captioning?
(A) Overfitting in training data
(B) Poor camera quality
(C) Low pixel density
(D) Absence of RGB values
19. In image captioning, what is “teacher forcing”?
(A) Manually labeling captions
(B) Feeding actual output during training
(C) Using only CNN layers
(D) Encoding data with noise
20. What does the term “vocabulary” refer to in image captioning?
(A) Set of image labels
(B) Number of input features
(C) Set of all words used in captions
(D) Collection of model parameters
21. What is beam search used for in caption generation?
(A) Training optimization
(B) Data augmentation
(C) Sequence prediction
(D) Evaluation metric
22. The term “Show and Tell” refers to:
(A) A captioning dataset
(B) A training tool
(C) A deep learning model
(D) A loss function
23. Which layer captures time-dependent patterns in sequence generation?
(A) Dense Layer
(B) Convolution Layer
(C) Recurrent Layer
(D) Normalization Layer
24. What is the main input to the decoder during testing in image captioning?
(A) True label
(B) Previous word prediction
(C) Random noise
(D) Entire image
25. Which of the following is a pre-trained model commonly used for feature extraction in image captioning?
(A) VGG16
(B) GPT-2
(C) YOLO
(D) UNet
26. What does the “context vector” in attention models represent?
(A) Evaluation result
(B) Caption score
(C) Weighted image features
(D) Learning rate
27. Why is dropout used in image captioning models?
(A) To reduce model size
(B) To improve image clarity
(C) To prevent overfitting
(D) To increase data throughput
28. Which method improves robustness of captioning models?
(A) Label smoothing
(B) Histogram equalization
(C) Pixel quantization
(D) Dilation
29. Which of the following is a captioning benchmark dataset?
(A) VOC 2007
(B) Open Images
(C) Flickr8k
(D) LFW
30. What role does a tokenizer play in image captioning?
(A) Enhances image edges
(B) Segments objects
(C) Converts sentences into word indices
(D) Compresses feature vectors
31. Which deep learning technique allows for generating varied captions for the same image?
(A) Deterministic decoding
(B) Greedy search
(C) Stochastic sampling
(D) Image resizing
32. What is “caption diversity” in image captioning?
(A) Image resolution variance
(B) Number of objects detected
(C) Variety of expressions for same content
(D) Object detection accuracy
33. Which transformer-based model is adapted for image captioning?
(A) BERT
(B) ResNet
(C) ViLT
(D) Vision Transformer (ViT)
34. What does the term “visual grounding” mean in captioning?
(A) Aligning image regions with textual phrases
(B) Training with GPU
(C) Reducing model size
(D) Labeling background
35. Which of the following can improve caption fluency?
(A) Increased dropout
(B) Sentence embedding
(C) Word repetition
(D) Random word shuffling
36. Which evaluation metric accounts for semantic similarity in captions?
(A) BLEU
(B) CIDEr
(C) METEOR
(D) SSIM
37. Which of the following does not belong to image captioning evaluation metrics?
(A) ROUGE
(B) BLEU
(C) CIDEr
(D) RMSE
38. In a captioning model, which layer is most likely used at the end of decoder?
(A) Softmax layer
(B) Max-pooling layer
(C) Dropout layer
(D) Convolutional layer
39. Which word usually marks the start of a generated caption sequence?
‘)” /> (A) [CLS]
” onclick=”checkAnswer(‘q39’, ‘‘)” /> (B)
” onclick=”checkAnswer(‘q39’, ‘‘)” /> (C)
” onclick=”checkAnswer(‘q39’, ‘‘)” /> (D)
40. What does “end-to-end training” mean in image captioning?
(A) Only training CNN part
(B) Only training RNN part
(C) Training entire model together
(D) Using pretrained decoder
41. Which of these is used for fine-tuning captions after generation?
(A) Caption synthesizer
(B) Post-processing heuristic
(C) Language model reranking
(D) Object detector
42. Which part of the image is mostly used in spatial attention?
(A) Image metadata
(B) Entire image as a single vector
(C) Region-specific features
(D) Histogram of intensities
43. Why are hierarchical models used in captioning?
(A) For filtering noise
(B) For faster computation
(C) To model sentence structures
(D) For compressing features
44. In self-critical sequence training (SCST), reward is computed using:
(A) Decoder weights
(B) CNN loss
(C) Evaluation metric like CIDEr
(D) Batch normalization
45. Which of the following helps model rare words in captions?
(A) Dropout
(B) Beam width
(C) Subword tokenization
(D) Feature normalization
46. Which method avoids repetition in captioning outputs?
(A) Greedy decoding
(B) N-gram blocking
(C) Convolutional pooling
(D) Object tracking
47. What is a major limitation of greedy decoding?
(A) High training time
(B) Requires labeled bounding boxes
(C) Misses better global sequences
(D) Increases vocabulary
48. How is the quality of generated captions usually assessed?
(A) Histogram matching
(B) Human evaluation and automated metrics
(C) Color quantization
(D) Model size
49. Which component in an image captioning model interprets visual data into a fixed-size representation?
(A) Decoder
(B) Tokenizer
(C) Encoder
(D) Softmax Layer
50. Which of the following best describes a key advantage of using Transformers in image captioning?
(A) Faster image rendering
(B) Better spatial resolution
(C) Parallel processing of sequences
(D) Reduced memory usage
