Moonshot-v1-32k-vision encounter troubles in Text Detection

Chisato · August 13, 2025, 10:30am

I want to use Moonshot-v1-32k-vision to detect manchu Text columns in a picture.

However, I got 2 issues.

I passed response_format={'type': 'json_object'}, but the json it replied is incomplete, as the screenshot below:

image1065×679 85.1 KB
the detection boxes’ position offsets，like the screenshot. I wonder if the Vision LLM resizes my picture.I want to know the width and height of the picture after it resized.

image920×1165 150 KB

Anyone help me? Thank you!

yuikns · August 14, 2025, 1:03pm

For the first issue, please help us check its finish_reason.

If it’s due to length, you may need to adjust the max_tokens to allow for more output.
Of course, it could also be mistakenly interrupted by the repeat detector due to a large amount of repetitive content. We hope to confirm this.

For the second issue, we will reduce the quality of the image to alleviate the load pressure, but we generally do not reshape it. Therefore, it is recommended to ask the model to provide the bbox based on the relative position of the image.

For example, given a bbox:
[0.324, 0.315, 0.909, 0.919], it could be the relative coordinates of [x1, y1], [x2, y2].

For an image with a size of 10000 * 20000, the coordinates would be [3240, 6300, 9090, 18380].

yuikns · August 14, 2025, 1:11pm

Could you please provide an original image and a prompt? We can try to debug and confirm the model’s performance based on it, which might lead to a better practice.