The document focuses on a specific flaw in the processing of voice commands. Essentially, the design of both Alexa and GNow accounts for human voice as being non-linear; our voice pitch rises up and down as we issue commands, whether we know it or not. To cater for this, the sound frequencies are "filled in" when our commands are transferred from the microphone to the device's software, thus producing a composite block of sound. Figuring out how this block works and then replacing it with ultrasound is how Song and Mittal struck upon their idea.
While previous research endeavours have worked with audible sound (such as this example), this appears to be the first known research with inaudible sound. In fact, the trial has been so successful that their inaudible voice commands worked 100% of the time with a Nexus 5X, 3m away, and 80% of the time with an Echo, 2m away.
Here's a demo of the research.
The opportunities for physical interception of such devices are pretty limitless, although it will require some effort on the part of the attacker to construct a generator which can issue commands of this nature. They may, of course, also need a recording of the device owner's voice (if the owner's GNow or Alexa-based device has been set up for such customisation) and a good-quality speaker. However, it isn't impossible. We're keen to hear of any efforts where such attacks have taken place; it's only a matter of time.
The full report is here.