Web Speech API

Sunday, April 14, 2013

TalkShow Story Illustrator: pictures of the things you're talking about

I had a lot of fun writing this app, but I still don't know exactly what it is. Art project, toy, presentation tool, a new kind of instrument you play with your voice?

Now that Google has released Conversational Search, you might call this Passive Conversational Image Search.

TalkShow

TalkShow shows pictures of the things you're talking about.

User's Guide

Click the microphone button
Speak slowly and clearly
Talk about people, places, and things
Marvel at the slideshow of occasionally relevant images

The program doesn't use voice commands. Turn it on, and it listens to your conversation, constantly analyzing your speech, looking for any mention of a person or place. It searches a lot, displays little.

Start by saying some famous names: TV shows, bands, cities, monuments, celebrities, historical figures, brands, products. Anything with an image that might be found on the internet. Pause between names.

If it doesn't recognize a name, try using the name in a complete sentence. Entity recognition works better on narrative text than on lists of isolated names.

Try telling a story that mentions some well-known people, places, or things.

Talk about your last vacation or the time you met a celebrity.

Turn it on a let it eavesdrop while you have a conversation.

What it does

TalkShow transcribes your speech, mines the text for proper names, does a series of internet image searches, and displays the search results as a slideshow.

When you pause for a second, the transcribed words since the last pause are examined. If there are capitalized words in this chunk, a search for 20 hits is initiated for each contiguous chunk. If there are none, the entire phrase is searched for three hits.

Between pauses, the program repeatedly checks the words that have arrived since the last pause, looking for new names. If it finds one, it interrupts the slideshow to display the new image as soon as possible.

The slideshow cycles through the results for all queries, looking for images that have not been displayed yet, paying attention to the rank of an image in the search results list. Top hits for each query are displayed first.

How it works

This is instant image search using continuous narrative speech transcription and named entity recognition.

Speech Recognition using Web Speech API by Google Chrome Version 25

Named Entity Recognition by Alchemy API

Image Search by YahooBOSS and flickr

Slideshow using jQuery

Speech Recognition:

Voice Driven Web Apps: Introduction to the Web Speech API - HTML5Rocks Updates
http://updates.html5rocks.com/2013/01/Voice-Driven-Web-Apps-Introduction-to-the-Web-Speech-API

Named Entity Recognition:

Alchemy API
http://www.alchemyapi.com/

The TalkShow javascript code decides what to search partly based on the results of Alchemy's Named Entity Recognition. Alchemy is consulted only on completed phrases. The simpler approach of looking for sequences of capitalized words in the speech transcript is used for interim results, allowing faster display of images for a named entity. Capitalization of names is done by the Speech Recognition software only in English at the moment.

Alchemy works for English, French, German, Italian, Russian, Spanish, and Swedish. For other languages, we look up every pause-delimited chunk. You control the program by your placement of pauses.

Image Search:

Yahoo BOSS
http://developer.yahoo.com/boss/search/

flickr
http://www.flickr.com/

Interpreter: hands-free continuous voice-to-voice translation on the web

Speech recognition apps are easy to use but take a lot of attention. You click a button to start talking, then click it again when you're done. To hear the translation, you click another button. That's three clicks per sentence, repeated over and over.

The new Web Speech API (currently supported only by Google Chrome version 25 and above), does continuous speech recognition. You click a start button, and then talk as long as you like. It listens and transcribes as you speak. This makes new kinds of speech-driven interfaces possible.

Interpreter

Interpreter is a hands-free translator made from:

Web Speech API for Speech Recognition by Google Chrome Version 25
Translation by Bing Translator
Text-to-Speech by Google tts

The app listens for gaps in your speech of a second or more. The text returned by speech recognition between pauses is submitted for translation. When the translated text comes back, the app shuts off the microphone to avoid feedback while it plays the translated text chunk. When the playback is done, it turns on the microphone again to capture the next chunk of speech. The app also back-translates to the source language. Back-translation will magnify errors in the original translation, but can be useful if you don't speak the target language. Click on the title bar to tell the app to read either the translation or the back-translation.

Saturday, April 13, 2013

How to use the latest Microsoft Azure translator API with access tokens, in PHP and jQuery

Unless you're working with Microsoft tools, it's hard to find sample Javascript code for the Azure translator API that actually works. There's a lot of obsolete code written for the version that used appId, deprecated since 2012.

Microsoft Translator API has gone through at least 3 major revisions. This doc is about the latest version as of Dec 2016. From Microsoft:

Beginning January 1, 2017, subscribers to the Microsoft Translator Text and Speech APIs will have limited access in Azure DataMarket. We recommend that you move to Azure now to avoid service disruption.

The good news is it turns out only a minor change is needed to the small php script that requests tokens.

Wang Pidong's very helpful post shows the bare-bones of how to use the new access token methods, from the unix command line.

Here's how it works with PHP and jQuery. Online demo of the code listed below:

http://www.johndimm.com/FunWithSpeech/BingTranslator/

To get started, you need to host a server-side script whose only job is to get access tokens from Windows Azure. Each token is valid for 10 minutes. It will be stored in the browser and used for direct access between the user's machine and Azure's servers, avoiding the need to proxy all requests through your server. This method also lets you avoid exposing your subscription key in the client-side javascript.

Here's a simple PHP script, named token.php, to get the token. You will need to supply your own subscription key at the top of the file.

<?php

// Get a 10-minute access token for Microsoft Translator API.

$url = 'https://api.cognitive.microsoft.com/sts/v1.0/issueToken';
$sub = "your Azure subscription key for text translation";
$pair = "Ocp-Apim-Subscription-Key=$sub";
$data = "{body}";
$headers = array($pair);

$ch = curl_init();

$data_string = json_encode($data);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);

curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Content-Type: application/json',
'Content-Length: ' . strlen($data_string),
'Ocp-Apim-Subscription-Key: ' . $sub
)
);

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

$token = curl_exec($ch);

// Return the token inside some json, as expected by index.html.
print '{"access_token":"' . $token . '"}';
?>

Here is the url to that script running on my server: token.php

To check your setup, make sure token.php returns something similar to this:

{"access_token":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzY29wZSI6Imh0dHBzOi8vYXBpLm1pY3Jvc29mdHRyYW5zbGF0b3IuY29tLyIsInN1YnNjcmlwdGlvbi1pZCI6IjZlMmI3ODM5MmJjMDQ1NGFhNzVlNDRhYzY0ZjFlZjc1IiwicHJvZHVjdC1pZCI6IlRleHRUcmFuc2xhdG9yLkYwIiwiY29nbml0aXZlLXNlcnZpY2VzLWVuZHBvaW50IjoiaHR0cHM6Ly9hcGkuY29nbml0aXZlLm1pY3Jvc29mdC5jb20vaW50ZXJuYWwvdjEuMC8iLCJhenVyZS1yZXNvdXJjZS1pZCI6Ii9zdWJzY3JpcHRpb25zL2U5MWM3NjgwLTk2ZWItNDAzYi05ZmZiLTNlZmQ1MjgyOTRiOS9yZXNvdXJjZUdyb3Vwcy9hX3Jlc19ncm91cC9wcm92aWRlcnMvTWljcm9zb2Z0LkNvZ25pdGl2ZVNlcnZpY2VzL2FjY291bnRzL3RyYW5zbGF0ZV90ZXh0IiwiaXNzIjoidXJuOm1zLmNvZ25pdGl2ZXNlcnZpY2VzIiwiYXVkIjoidXJuOm1zLm1pY3Jvc29mdHRyYW5zbGF0b3IiLCJleHAiOjE0ODI4MjM1NDh9.rVfSPUd2yBF0j64L-SrBZC7nIHOsKCIrHOiRs0YpCmE"}

Now that you can get a token, it's time to use it to do a translation. Here's an html page that displays a two-box interface that translates English to French. The javascript accesses two server scripts:

token.php: the script listed above. This needs to be installed on the same server as the html page below. The html page accesses the token.php script using json. The token is refreshed every 9 minutes.
Translate at api.microsoftranslator.com: Since your script cannot reside on microsoft's server, we have to use jsonp to allow cross-site access. This script is accessed each time a translation is requested.

<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>

<script language="javascript">
var g_token = '';

function onLoad() {
// Get an access token now. Good for 10 minutes.
getToken();
// Get a new one every 9 minutes.
setInterval(getToken, 9 * 60 * 1000);
}

function getToken() {
var requestStr = "./token.php";

$.ajax({
url: requestStr,
type: "GET",
cache: true,
dataType: 'json',
success: function (data) {
g_token = data.access_token;
}
});
}

function translate(text, from, to) {
var p = new Object;
p.text = text;
p.from = from;
p.to = to;

// A major puzzle solved. Who would have guessed you specify the jsonp callback in oncomplete?
p.oncomplete = 'ajaxTranslateCallback';

// Another major puzzle. The authentication is supplied in the deprecated appId param.
p.appId = "Bearer " + g_token;

var requestStr = "http://api.microsofttranslator.com/V2/Ajax.svc/Translate";

$.ajax({
url: requestStr,
type: "GET",
data: p,
dataType: 'jsonp',
cache: true
});
}

function ajaxTranslateCallback(response) {
// Display translated text in the right textarea.
$("#target").text(response);
}

function translateSourceTarget() {
// Translate the text typed by the user into the left textarea.
var src = $("#source").val();
translate(src, "en", "fr");
}
</script>
<style>
#source, #target {
position:relative;
float:left;
width:400px;
height: 50px;
padding:10px;
margin: 10px;
border: 1px solid black;
}

#translateButton {
float:left;
margin: 10px;
height:50px;
}
</style>
</head>

<body onload="onLoad();">

<textarea id="source">Text typed here will be translated.</textarea>
<button id="translateButton" onclick="translateSourceTarget();">Translate English to French</button>
<textarea id="target"></textarea>

</body>
</html>