Moonshot-v1-32k-vision encounter troubles in Text Detection

I want to use Moonshot-v1-32k-vision to detect manchu Text columns in a picture.

However, I got 2 issues.

  • I passed response_format={'type': 'json_object'}, but the json it replied is incomplete, as the screenshot below:

  • the detection boxes’ position offsets,like the screenshot. I wonder if the Vision LLM resizes my picture.I want to know the width and height of the picture after it resized.

Anyone help me? Thank you! :hot_face:

For the first issue, please help us check its finish_reason.

  • If it’s due to length, you may need to adjust the max_tokens to allow for more output.
  • Of course, it could also be mistakenly interrupted by the repeat detector due to a large amount of repetitive content. We hope to confirm this.

For the second issue, we will reduce the quality of the image to alleviate the load pressure, but we generally do not reshape it. Therefore, it is recommended to ask the model to provide the bbox based on the relative position of the image.

For example, given a bbox:
[0.324, 0.315, 0.909, 0.919], it could be the relative coordinates of [x1, y1], [x2, y2].

For an image with a size of 10000 * 20000, the coordinates would be [3240, 6300, 9090, 18380].

1 Like

Could you please provide an original image and a prompt? We can try to debug and confirm the model’s performance based on it, which might lead to a better practice.