Well, you first start with an image. The image can either be very odd, like a photo of a person expressing some extreme (or alternatively, ambiguous) emotion. Or like a photo of a person or a group of persons in a very weird pose, like a dancer caught midst-performance. Or the image can be pretty banal and uninteresting, in of itself, in which case the text has to do the heavy lifting. Bonus points if the image is going viral, or is something from pop culture which everyone knows and/or is talking about.
Now it may seem as if I'm just covering all bases without saying anything, but understand that the type of image affects the attendant text, if you want to make a good proper meme.
So the point of the meme is to use text to relate the image to a different situation than the one portrayed in the image. This is usually done by evoking a familiar, relatable scenario, like "when the car behind beeps immediately after the green light", "when you stub your toe", etc. See examples below:
These were the first-person POV "when you"/"me when" memes. Alternatively a character or figure in the image is used to represent something else besides "you" or the audience:
The humour comes from one or more of several factors. One is the contrast between the text scenario and the image. Another is the relatability of the situation to the audience. "Yeah, this happens to me too", "This perfectly captures the experience". Alternatively, another could be the hyper-specificity of the situation, which would have the audience go "whom does that even happen to", "how do you even think of this". Another could just be how viral and well-known the image is; the humour comes from the recontextualisation of the image.
In the meme I shared, the humour comes from the relatability of the situation; anyone who has had to take care of toddlers knows how they actively go around trying to kill themselves and have to be dragged away from danger. There is also the image-situation contrast I referred to earlier, where the deliberately posing man becomes the representation of a clueless toddler just moments before disaster.