Hello, the AnyDepth algorithm you proposed is very innovative and attractive, it's a great job!
I am trying to combine SDT head with Dino v3 distilled convnext network, but I found that the four layer features of convnext network are different from those of vit. The four layer features output by vit are consistent in spatial dimension, but the four layer features of convnext network decrease by a multiple of 2 in spatial dimension. This makes it impossible to directly apply the structure of SDT. Do you have any good suggestions from the author?
Hello, the AnyDepth algorithm you proposed is very innovative and attractive, it's a great job!
I am trying to combine SDT head with Dino v3 distilled convnext network, but I found that the four layer features of convnext network are different from those of vit. The four layer features output by vit are consistent in spatial dimension, but the four layer features of convnext network decrease by a multiple of 2 in spatial dimension. This makes it impossible to directly apply the structure of SDT. Do you have any good suggestions from the author?