WHITE PAPERS Localization

Contents

This document outlines requirements for a localization system that can serve the needs of user interface for multiple products using various different UI technologies.

1. General Requirements for a Localization System

A localization system should satisfy the following requirements for all technologies.

1.1. Localization Features

A localization system should support the following basic features...

1.1.1. String Selection by Key

The localization system should be able to select an appropriate string by key, from a set of resource strings, that is translated for one of the supported locales.

To support the possibility of all strings for all UI components for all locales being present within the same database (possibly as a side effect of a certain translation work flow processes, the key used to identify a string may contain the following parts...

	locale
	product / component path
	version
	string key
	variant identifier

1.1.2. String Substitution

The localization system should support substitution of values into an arbitrary number of tokens inside resource strings.

Furthermore, the substitution system should support values being of specific types which can inform formatting of their values and provide cues to a string variant sub-selection mechanism that can select available variants that best match the qualities of the substitution values.

1.1.3. String Variant Sub-selection

In cases where genders and/or quantities need to affect the translation of a resource string for a specific language, the localization systems needs to support the sub-selection of the best string variant.

Since string variants may not be necessary (and may not exist) for all languages, the string variant sub-selection mechanism should support a system for fallbacks, such that a more generic variant is selected unless a more specific variant is defined. To allow the desired priority to be controlled, a simple algorithm can be employed that iterates through all the defined variants and selects the last matching variant, leaving it up to the author of the translations to order the variants from most general to most specific.

1.1.3.1. An Example

To illustrate this with an example, say that a string has two substitutions, one for a person's name and the other for an object, where the person's name will have an associated gender and the object will have an associated quantity, both of which will be provided as translation cues to the string variant sub-selection mechanism.

Now, in the worst case, the translation service may translate the resource string for the most rudimentary substitution, providing only a single base variant. In the best case scenario, the translation service may translate the resource string for all possible logical values of gender and quantity. In between these two extreme cases, a translation service may translate the resource string separately for possible values for gender and for possible values of quantity, but not combinations of the possible values. In such a case, the order of the translated strings can determine whether a matching translation for a possible value of gender takes priority over a matching value for a possible value of quantity, or vice versa.

1.1.3.2. Pluralization

Language Plural Rules (CLDR)

1.1.4. Number Formatting

The localization system should support formatting a number correctly for the locale.

Most locales, for example, will require a "." (period character) to be used to indicate the decimal point, while some specific locales will require that a "," (comma character) be used.

Properties of locale specific number formatting...

	Digit Group Separator - a "," (comma character) in many locales, but a "." (period character) in some locales and a " " (space character) in some others
	Digit Grouping - for most locales it is groups of three, but for Hindi it is only a group of three for hundreds and groups of two digits for all digits above the hundreds group
	Fractional Part Separator - a "." (period character) in many locales, but a "," (comma character) in some locales
	Digit Characters - some languages use specific characters, and not all languages have ten digit characters (Japanese has 11 digit characters, with a special character for the number 10)

1.1.5. Specific Numerical Type Formatting

Certain numerical types, such as currency values, percent values, measurement values, etc. have special formatting considerations that differ by locale.

	Measurements - Different types of measurements can be expressed in different units for different locales. For example, users in the USA may prefer to see fuel efficiency ratings for vehicles shown in MPG (Miles Per Gallon), while users in Europe would prefer L/100 km (Liters Per 100 kilometers). Formatting in this case would involve a value conversion as well as appending a different units suffix.
	Percentages - In most locales, the percent symbol for percentage values will be displayed before the number, while in some locales it should be displayed after the number.
	Currency Values - Formatting of currency values can involve locale specific number formatting, currency symbol, and currency symbol placement.

1.1.6. Currency Formatting

The localization system should support formatting of currency values appropriately for a locale.

Formatting a currency value correctly involves a combination of using the appropriate currency unit symbol and placement, along with the appropriate number formatting for the locale. It is not the responsibility of the localization system to perform currency conversion based upon prevailing exchange rates - that can remain the responsibility of the application.

In the case of substituting currency values into resource strings, it should not be the responsibility of the code providing the currency value substitution to perform pre-processing of the currency value in order to format it, but ideally the currency value provided to the substitution mechanism would be of a type that would cue the resource string processing code to invoke the appropriate formatting before value substitution. Either the value type can be provided to the resource string processing code, or the value type for a particular token could be explicitly provided in a descriptor for the token, as in...

amountPaid: The amount of {paid:currency} has been debited to your account.

1.1.7. Date Formatting

The localization system should support formatting a date correctly for the locale, including the appropriate translations of day and month names.

As with currency formatting, when date type values are substituted into resource strings, it should not be required that the code providing the date value perform any pre-processing, but the substitution mechanism should be able to perform formatting of the date value as needed as part of producing the translated string.

joinDate: You activated your service on {activationDate:date}

Optionally, formatting cues may be provided - either as ancillary data that is provided along with and as part of the date value, or as an additional descriptor for the token in the resource string.

joinDate: You activated your service on {activationDate:date,short}

1.1.8. Additional Types to Format

Additionally, the localization system should support the following type formatting...

	Time Formatting
	Phone Number Formatting
	Mailing Address Formatting

1.1.9. Measurement Units

The localization system should provide support for localizing measurement values for a locale, which may involve converting the measurement values from values expressed in a canonical system of units, such as the SI / metric system, to measurement values expressed in a system like the Imperial System.

1.1.9.1. Imperial vs Metric

The localization system should provide a defaulting behavior for selection between showing measurements in Imperial vs Metric systems for any given type of measurement, with the ability of the user to override the default for all types or on a type-by-type basis.

Unless overridden by the user or the application, the Imperial system would be used for select types of measurements (such as length, area, volume, weight, temperature, speed, etc. measurements) for select locales (primarily the US, UK, Canada, Australia, New Zealand, and some others), and the Metric System would be used for all other locales. The Metric system would be used for all other types of measurements for which the Metric System is the universally adopted system of measurement (such as voltage, power consumption, etc. measurements) - even in locales that have a historical preference for the Imperial System.

1.1.9.2. Unit Symbol Localization

Beyond the system of units that is used for a particular type of measurement, the symbols that are used to denote the unit also needs to be localized.

In many locales where the Metric System is used for measurements, the standardized and non-localized metric unit names may be acceptable, but in some locales this is not desirable, so the system needs to support the ability to provide translations for unit names for any unit type for any locale, with a suitable defaulting / fallback mechanism to reduce redundancy where unit names do not need special translation.

1.1.10. String Collation (aka Sorting)

The localization system should support string collation (sorting), taking into account how ordering rules apply to the character set in use by the language.

Sort orders such as alphabetical or ASCIIbetical are not meaningful for the writing systems for certain languages, such as Japanese kanji, and a different type of collation algorithm needs to be applied. The localization system can provide collation as part of a service API, considering that collation for certain languages can be complex and have a performance cost that is not desirable for a client device.

The localization system should provide the client with cues to indicate whether sorting can be performed on the client side with a simple algorithm or whether sorting should be offloaded to a service, and the client code should be written to anticipate that sorting might be a process involving an asynchronous service.

1.1.11. RTL Support

As far as RTL (right-to-left) layout is concerned, the localization system should support at least the ability to determine whether an RTL layout should be used for a specific locale.

Once it is known that RTL layout should be used for a locale, it is the responsibility of the UI framework to invoke RTL layout and the responsibility of the various UI components to implement support for RTL in their styling.

1.1.12. Style Conditionalization

The UI framework should support selection of styling code based upon the locale information queried from the localization system.

Style conditionalization can be performed in either (or both) of two ways…

	selection of a style variant for the locale that materializes conditionalization into the styling code through a compilation process (so, each locale gets a modified version of common, parameterized style code)
	loading of a base style for all locales combined with loading of a locale specific set of style overrides

1.1.13. Media Conditionalization

The localization system should facilitate the conditional selection of media appropriate for a given locale.

At the very least, the localization system should provide the necessary locale information to inform conditional selection of media, but the system may also provide a framework for defining locale specific overrides for media along with a system for selecting a best match for media for a given locale, with a fallback heuristic to address situations where very locale specific media variants are not provided.

1.1.14. Locale Content Tailoring

The localization system should provide an application the locale information necessary to allow it to tailor its content to be most suitable for any given locale.

1.1.14.1. Content Filtering vs Content Ranking

While it is not the responsibility of the localization system to determine suitability of content for a locale, the basic locale information provided by the system can be used by a separate content system to inform filtering and ranking of content for any locale.

For instance, certain content may never be appropriate for a specific locale and it may be desirable to filter it out completely. In cases where content is permissible for a specific locale but is not best suited for that locale, the locale can be used to effect a negative weighting for such content for content searches and feeds in that locale. In such a content filtering and ranking system, locale may be just one of many dimensions that impacts filtering and ranking, along with dimensions such as user's age, sex, occupation, ethnicity, etc.

1.1.15. Feature Configuration

The localization system should provide an application the locale information necessary for it to determine which features of the application should be enabled and which should be disabled.

It is not the responsibility of the localization system to resolve feature configuration for a specific locale - this is the responsibility of a separate decision engine that will likely accept multiple input parameters to allow multi-dimensional configuration. That is to say, whether or not a particular feature is enabled may be determined by multiple different factors, one of which may be the user's locale (or data derived from the user's locale). In this sense, feature configuration is similar to locale content tailoring.

1.2. Required Characteristics of Localization System

1.2.1. Switchability on the Client

It should be possible to switch the locale to be used throughout an application on the client.

1.2.2. Concurrency on the Server

It should be possible to server an application to users in multiple different locales from the same server.

To this end, it would not be a viable solution to build the set of localized assets for each of the supported locales while keeping the URIs the same - the URI for each localized asset would have to be different per locale, which also means that the client would have to request the appropriate localized version of an asset for the client's locale. Not encoding the locale into the path of the URI for a localized asset would leave only the option of serving the localized assets for different locales from different domains and, while there may in many cases be domains dedicated to certain of the major locales, it cannot be assumed to always be the case, Furthermore, it may be desirable to serve all the static assets from a central CDN with a common domain, even while the application may be served from locale specific domains.

1.3. Desired Characteristics of Localization System

1.3.1. Live Locale Switching

To increase the productivity of those doing translation for of resource strings along with those doing visual testing of localized UI, it should ideally be possibly to switch between locales without incurring a costly full reload of the page or app.

All UI components would ideally be implemented to allow updated resource strings to be dynamically propagated into the UI, and the app environment would support the ability to switch locale and have the necessary resource strings for the new locale be loaded for all the localized UI controls.

Live locale switching would ideally be supported at a UI component level, with the ability to use a locale switcher for an individual component. When a component is switched to a new locale, all of its child components and their child components (i.e. the entire component tree beneath it) should be switched to the new locale as well.

1.3.2. Dynamic / Live Updating

In order to enable more responsive tools for translating resource strings and instantly previewing the translations in the UI, the UI framework's localization mechanism should ideally support dynamically updating the UI as resource strings are modified.

Given such a live updating mechanism, translators can then visually preview their translations and correct layout issues immediately by making choices between different candidate translations to achieve the best layout for the language they are translating for, rather than having problems slip in and only be discovered during localization QA testing or - even worse - once the code has already been pushed to a production environment.

1.3.3. Media Template Processing

While it is desirable to compose resource strings into the presentation as part of a runtime rendering process, it is sometimes not possible to render all parts of the presentation dynamically.

In cases where it is necessary to localize media assets that cannot be rendered dynamically, because of performance issues or a lack of the rendering technology in the client, it is desirable to have a system in place to allow for rendering of templatized media of various types, either as part of a build process or as an ob-demand process that is load optimized through a type of caching facility.

An alternative to a media template processing system is to rely solely on media conditionalization with the job of rendering templatized media being the responsibility of production processes of a creative department. This can work at a small scale, but becomes increasingly burdensome as product complexity and supported locales increase. A manual production process also does not lend itself well to an integrated translation process.

1.3.4. Locale String Inheritance

For any UI component system, whether implemented for Web technologies or native operating system frameworks, certain inheritance characteristics should be supported.

1.3.4.1. Inheriting Locale Strings From a Superclass

For a UI component class that extends / subclasses a UI component base class / superclass, the resource strings of the superclass should be inheritable.

In this way, a UI component subclass can extend its superclass and gain the benefit of all the localization that has already been done for it, without having to duplicate that effort. The same types of reuse patterns that one may with to employ with other aspects of the code one should also be able to apply to localization.

1.3.4.2. Overriding Inherited Locale Strings

For resource strings inherited from a UI component's superclass, it should be possible to override any or all of them.

As with other inherited features, overriding a resource string in a UI component subclass should have no effect on the component's superclass and instances of both should be able to coexist in the page together without any conflict.

1.3.4.3. Declaring Additional Locale Strings

For any UI component subclass, it should be possible to augment the set of resource strings inherited from its superclass by declaring additional resource strings for the subclass.

2. Special Requirements for Web Localization

2.1. Concurrency on the Client

The UI framework should ideally support multiple locales in use on the client concurrently.

This will support the ability to perform side-by-side comparisons of multiple instances of the same UI component set to different locales, in order to make judgements on translations as they impact layout, and layout and styling as they impact available translation options, thereby helping to reconcile conflicts between translation and layout/styling.

2.2. Encapsulation of Locale-specific Resources

The localization system should facilitate the encapsulation of locale-specific resources with the code modules that require those resources.

A UI framework should make use of the localization system's support for encapsulation of locale resources by encapsulating locale resources for UI components with the components themselves, thus allowing locale resources for UI components to be packaged along with the other code (JavaScript, CSS, HTML, etc.) for the components so that components can be delivered to a client across a network with fewer network requests.

In order to support encapsulation of locale-specific resources, the localization system should provide support for harvesting locale resources that are distributed throughout a codebase, potentially combining the resources together in a flat database for submission to a translation service, and then redistributing the translated resources to their correct original places throughout the codebase.

2.3. Locale Strings as Dependencies

A set of resource strings needed by a UI component should be expressed as a dependency of the UI component, in order to provide a standard mechanism by which to load them dynamically during development and by which to package them along with other JavaScript code for deployment to a production environment.

2.3.1. Parameterized Dependency

Because it is not possible for a UI component's code to know what locale the code will be run in, expressing a dependency on a set of resource strings can only go as far as providing a base guide, where the runtime environment will determine what exact module is required based upon the locale set for the environment at the time of loading the UI component.

2.4. Determining String Origins

Determining the origins for externalized resources is a nice-to-have that can help developers get to the source of resource string related issues that surface in UI...

	requires some form of tagging identifier for every resource
	for a system that uses a RESTful approach to resource strings, each resource string could have a URI, which could be used either to identify a place in a DB or CMS where the string can be changed, or it could be mapped to a resource file in the source codebase
	for resource strings bound to HTML data, string could be replaced with span with title attribute, or HTML comment could be inserted that could be found in a DOM inspector
	for resource strings that make there way into element titles or select tag option values, a different approach would be needed

3. Serialization of Types

Various value types that can be displayed to the user, either alone or as substitutions in resource strings, may have locale specific serializations.

Locale-specific formatting of a number value, for instance, can be viewed as a form of locale-specific serialization to string of number type values.

3.1. Currency Values

Monetary amounts may have locale-specific formatting and an associated currency, but currency, number formatting, and locale can also all be independent of one another.

3.1.1. No Canonical Currency Units

Canonical values don't exist for currency values in the same way as they do with linear dimension values, for example.

Conversion between different currencies must involve a central monetary exchange service that is aware of the up-to-date intercurrency exchange rates - it is not possible to implement static code as part of an internationalization utilities library that can perform conversion for the purpose of display.

Therefore, a currency value type should always carry the currency for the value independent of the locale setting for the application.

3.1.2. No One-to-one Locale-to-currency Mapping

There is not a one-to-one mapping between locale and currency.

Certain locales may have an associated currency (or currencies), but it can't be assumed that there is a specific currency for a specific locale. Panama, for example, supports both the US Dollar and the Panamanian Balboa.

In cases where the appropriate currency can't be determined from either the locale in which the application is offered or the locale of the user, the application may wish to make the currency selectable by the user, and the application may also wish to display monetary amounts in two or more currencies. A traveler in the United Kingdom, for example, may have a combination of Pund Sterling and Euros in their wallet and it may be convenient to see amounts in both currencies, with the secondary / non-dominant currency shown in a separate column or in parentheses.

3.1.3. Number Formatting Versus Currency

Number formatting and currency formatting can be distinct from one another and should not be conflated.

For example, in an application that displays amounts denominated in Euros in Germany, there would be an expectation for the correct Euro currency symbol to be used as well as the appropriate use of the period character to delimit thousands and the comma character to indicate the decimal position.

AMOUNTS DISPLAYED IN GERMANY

€ 1.234,56  (US $1.973,04)

In contrast, if a Euro-denominated amount were to be displayed as a comparison alongside a US dollar amount to users in the US, there would be an expectation for the thousands in the Euro amount to be delimited using commas and the decimal place indicated by a period.

AMOUNTS DISPLAYED IN US

€ 1,234.56  (US $1,973.04)

4. Types in the Context of Localization

4.1. Locale Specific Serialization

Every type should be able to have a locale specific serialization.

Locale specific serializations may support multiple serialization options. For example, a date type can support narrow, short, or long options for displaying month. A number type can support displaying digit grouping or not.

4.1.1. Inheriting Serialization Options

A type that inherits from a more basic type, such as a currency type that inherits from the basic number type, should inherit the display options from that basic type.

So, a currency type would allow display of grouping to be specified as a serialization option, but would also extend an option for display of the currency type (either the currency symbol or the currency code).

4.2. Types When Used in Resource Strings

4.2.1. Specifying Types

When a resource string expects substitution values, the type for every value expected by the string should be specified.

4.2.1.1. Specifying Types Should be Required

It should be a requirement that the type be specified for every substitution value expected by a resource string.

This is to ensure that translators always have sufficient information that can be obtained purely from the resource string files to inform their translation process, without having to rely on the resource files being used in the running application with type information supplied only at runtime.

With type information contained inside the resource string files, tools with awareness of types can supply type example values to aid in the translation process and allow string translations to be previewed before being re-integrated into the codebase and then deployed to a staging area. Policing and auditing tools can be employed to enforce this requirement when sanity checking the resource string files.

4.2.1.2. Casting Types

For any substitution token of a resource string, it should be possible to cast a supplied value of a given type to a different, compatible type.

4.2.1.2.1. Casting from Raw

In the most basic case, it should be possible to cast raw number type values to special types such as currencies, measurements, percents, etc. for the purpose of applying locale-specific formatting.

4.2.1.2.2. Casting for Different Formatting Options

In some cases, it might be desirable to cast a value that has a supplied type to another type that defines additional formatting options that are desired for the localization.

An example of this might be receiving a date type value for a substitution token and casting the value to a time ago type for displaying how long ago a date is from the present time. In the same resource string, it might be desirable to use a substitution token to display a date value, while using another substition token in parentheses to display the time ago equivalent of the same date.

EXAMPLE

Your enrollment was approved on {approvalDate:Date} ({approvalDate:TimeAgo})

4.2.1.2.3. When Casting is Possible

A value of a certain type should be castable to a different type if...

4.2.1.2.3.1. Meaning is Not Broken

Meaning should not be fundamentally broken in the process of casting a value from one type to another.

For example, casting a distance type value to a currency type value would fundamentally break the meaning. On the other hand, casting from a date type value to a time type value would not fundamentally break the meaning, since time of day is contained within / implied by a date value.

4.2.1.2.3.2. The Cast Value Remains Correct

A cast value can have less precision than the original value, but the value may not be fundamentally incorrect.

If, for example, a color type value was cast to a shade type value that had possible values of dark, medium-dark, medium, medium-light, and light, precision would be lost but the value would not be incorrect. If, on the other hand, a currency in US dollars value was cast to a currency in Japenese Yen value, without applying the appropriate currency exchange rate, the cast value would most likely be incorrect.

4.2.1.2.4. Extending Types to Add Formatting Options

Types can be defined to extend other types simply to offer additional formatting options or type example values, but in no way fundamentally change the meaning of the type.

This pattern of extending is useful for manageing code so that the type system's code can be layered. The alternative would be for there to be only one type definition for any give type of value, and that single definition would have to grow in size to include all possible ways of formatting values of that type.

In situations where type extending is used to provide additional formatting options or example values, a resource string may cast a value of Type A to Type B, where Type B extends Type A by adding the formatting options that are desired by the resource string.

4.2.1.2.5. Types Can Have Resource Strings

The definitions for certain types, themselves, may have dependencies on resource strings.

Therefore, the type system should support localization in the same way as projects with resource strings that will leverage the type system. Take the example of a date type. In order to serialize date values for a specific locale, the type will require translated names for days and months for that locale for the formatting options that show day and/or month. Moreover, formatting of numbers may involve selection of characters from a digits set resource string to support certain locales that don't use the Western 0-9 characters.

4.2.2. Serialization Options Inherited from Master

When a token is defined for a resource string in the master resource strings file, any serialization options defined for the token should be inherited by the locale specific translations of the resource string, but should also be overridable for specific locales.

For example, for a specific string it might be decided that it is desirable for a date token to use the long option for month when formatting the substituted date values. Then, for a specific language (such as German), it might become evident during testing in German that the layout does not have sufficient room to accommodate long month names, so the narrow option could be specified explicitly for the token in the German translation.

4.2.3. Occurrence-specific Formatting

A substitution token may be used more than once in the same string, and it should not be assumed that every occurrence has the same formatting options.

As such, different formatting options may be applied for different occurrences of the same token in a string, or the value may even be cast to a different type in a specific occurrence.

4.3. Extensible System for Types for Resource Strings

The resource string system should support an extensible system for types, so that support for new types can be added over time.

A dependency management system should be employed so that the localization client code does not need to grow ever larger and larger as support for more type formatting of substitution values is added. Instead, when a resource strings file is compiled to a module, the type formatting used in resource strings throughout the file can be used to determine the specific type module and locale specific formatter module dependencies for just the resulting compiled resource strings module.

Such an approach would be much like the encodings system in UIZE's JST templates. The only thing that grows in the base is the registry of encodings to modules and methods to support those encodings, but the actual type formatting code does not become a dependency for all templates - only those encoding modules that are in use by the JST template being compiled.

If either a raw value (like a raw number) is passed as a substitution value and is cast to a type through a formatting option in the resource string, or if a value of a certain type is passed as a substitution value and is cast to a different type in the resource string, then the resource strings module has a dependency on the type module for the type used in formatting the resource string.

4.4. Type Example Values

Every type should be able to define associated test values that exhibit extremes in behavior or side effects for the type, or just offer convenience when testing the localization to a specific locale.

Types can have different test values associated with (but not exclusive to) different locales. For example, when testing locale strings in French, a PersonName type could have test values that are distinctively French names, but there's no saying that French names couldn't appear for other locales, especially considering how globalized and multicultural the world is.

So, locale specific example values for the PersonName type would be useful mostly as hints when testing for specific locales, especially to the extent that an application localized to s specific locale may have a predominance of users with names associated with that locale.

5. Useful Background Information

5.1. ICU

ICU (International Components for Unicode) is a project first developed by Taligent and then later absorbed into IBM after they acquired Taligent.

ICU provides code to support correct localization and is available in ports for C/C++ and Java. ICU covers character encoding, formatting of various data types, message formatting, time calculation and timezones, etc. ICU is in quite broad use and forms the basis for some of Google's localization support. They rely on ICU Message Format in their ARB (Application Resource Bundle) format, for example.

ICU Homepage

5.2. XLIFF

XLIFF (XML Localization Interchange File Format) is an XML-based file format for the transfer of localizable data during the localization process.

XLIFF was standardized by OASIS in 2002. It's not easy to ascertain the level of adoption of and tools support for XLIFF in the industry.

5.3. ARB

ARB (Application Resource Bundle) is a file format advanced by Google to package and organize localization resources.

The ARB format makes use of ICU Message Format for resource strings.

5.4. CLDR

CLDR (Unicode Common Locale Data Repository) is a project of the Unicode Consortium that is intended as a central repository for locale information.

Data from the CLDR project is incorporated in various ways into operating systems, Web browsers, and software development frameworks to allow applications to be localized. The data is available in LDML format (XML-based) as well as JSON format.

5.5. JavaScript Intl Object

The Intl object is being developed as part of a future version of ECMAScript to provide JavaScript with facilities for locale specific formatting.

The Intl object serves as a namespace and provides objects for string collation, date time formatting, and number formatting. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl

5.6. Miscellaneous

Innovations in 18n at Google

5.7. ISO Standards

Some relevant ISO standards...

	ISO 639 - language codes
	ISO 3166 - codes for countries, dependent territories, areas of geographical interest
	ISO 4217 - currency codes
	ISO 8601 - dates / times

6. Localization Guidelines

6.1. The Canonical String Principle

The canonical string principle states that for any resource string there should be an authoritative, canonical form.

Furthermore, as a matter of process / workflow, it should be possible to create translations of the string for all supported languages and replace any existing translations with the new translations without any ill effect. Adhering to this principle implies that the ongoing translation process should NOT rely upon the history of previous translations for strings.

Specifically, developers should not introduce variances in translations of a string for specific languages with the expectation that a translator will be able to correctly carry over / map those variances to new translations as the canonical string is changed over time.

Consider an example where a developer decides to style a specific part of a translated string only for a specific language, possibly to address some issue of layout in that language, and possibly by adding markup, markdown, or wikitext formatting to some part of the string. If the UI is later changed and the canonical string is modified, it might be impossible for the translator to understand how to map the language specific styling to the new translation.

To avoid these kinds of problems, the canonical string should be regarded as the only authority, and all translations of a string should be regarded as derived / generated. A better solution in such situations would be for the developer or translators to add metadata for the canonical string that notes any language-specific considerations for translating the string.

7. Pseudo-localization

In a nutshell, a mode of running an application that replaces externalized resource strings with pseudo-localized strings that are derived from the source strings.

EXAMPLE - SOURCE

E-mail address: [email protected]
Mobile phone: 555.123.4567

EXAMPLE - PSEUDO-LOCALIZED

[E-mail.. address...]: [email protected]
[Mobile... phone...]: 555.123.4567

7.1. Issues that pseudo-localization can help expose...

	resources that should be externalized, but aren't
	resources that are externalized, but shouldn't be
	layout that breaks as a result of expansion during translation
	programmatically concatenated resource strings
	software that doesn't correctly handle non-ASCII characters
	layout that can't accommodate vertical expansion of characters with diacritical / accent marks

7.2. Techniques that can be employed...

	wrapping resources in boundary delimiters (like brackets)
	using non-ASCII, accented versions of alphabetical characters
	adding additional characters to simulate expansion during translation

7.3. Two Process Approaches

	support pseudo-locales in all layers of tech stack, so that code can be tested with pseudo-locales without any dependency on supporting localization processes
	build pseudo-localization support into the localization process, so that all layers of the tech stack can get the benefit of pseudo-localization without having to implement their own programmatic support for it (which could also be inconsistent)

Naturally, both approaches could be used in conjunction.

7.4. Pseudo-locale

To allow pseudo-localized resource strings to co-exist with resource strings translated for various real locales, it is recommended that the pseudo-localized strings be associated with a pseudo-locale that can be selected in the way as real locales.

To facilitate this, it is recommended that the pseudo-locale be identified by a special pseudo-locale code that adheres to the same rules as the locale codes for real locales. It is, therefore, recommended that a BCP 47 compliant code be chosen to represent the pseudo-locale.

7.4.1. Pseudo-locale Code

The recommended BCP 47 format locale code to use for pseudo-localized resource strings is "en-ZZ".

BCP-47 does not address pseudo localization and, as such, does not provide any specific guidance on how to deal with locale codes for pseudo localized resource strings.

It does, however, leverage the ISO 3166-1 standard for region codes, and this standard allows for a limited set of user assigned region codes that are reserved for non-standard use. These ranges of codes can be used for any internal / proprietary use.

http://en.wikipedia.org/wiki/ISO_3166-1#Reserved_and_user-assigned_code_elements

Based on this provision, it is recommended that the locale code that is used to represent the pseudo-locale be a BCP 47 format locale code, so that the pseudo-locale code is fully compliant and doesn't risk any collision - now or in future - with language codes and region codes for real languages or regions. Although three letter region codes are acceptable, it is recommended that a code that lies within the two letter range for user-assigned region codes be used. This is for the sake of simplicity and consistency with the prevalent use of two letter region codes for other locales.

Based on the ISO 3166-1 standard, there are a few candidates for pseudo locale region codes that should appear to developers to be be obviously intended for a special purpose (pseudo-localization, in this case)...

	en-AA
	en-QQ
	en-XX
	en-ZZ – top choice

Out of this short list, the "en-ZZ" is the most appealing, since it stands out as probably having a special meaning (in case you see it for the first time), and it sorts to the end of the list of regional variants of English.

7.5. References

	http://en.wikipedia.org/wiki/Pseudolocalization
	http://google-opensource.blogspot.com/2011/06/pseudolocalization-to-catch-i18n-errors.html

8. Resource Files

The localization system should support the organization of resource strings into resource files, such that the resource strings are externalized from the application code that uses them.

8.1. Resource File Format

The format that is used for storing resource strings should satisfy a number of key requirements.

8.1.1. Pure Data Format

It is recommended that the resource file format should be a pure data format, completely devoid of the capabilities of the programming language used by the application that will use the resource strings.

8.1.1.1. Prevents Undesired Dependencies

This has the benefit of encouraging clear separation between the application logic and the resource strings.

Without such forced restrictions on what developers can do in the resource files, there will always be the temptation to access features of the environment into which the resource files will be loaded. If the resource files are implemented using the application framework's language (such as JavaScript, PHP, ActionScript, etc.), then a developer may be tempted to access a global environment variable or to call a function they expect to be defined in the runtime context, thus creating unwanted dependency relationships between the resource files and the state of the runtime environment into which they will be loaded.

8.1.1.2. Avoids Parsing Difficulties

Resource files that are implemented using the application framework's programming language pose challenges to parsing the files for the sake of automation of the localization process.

A pure data file format has constrained variability and strict rules for structure and serialization. In contrast, the syntax of a programming language supports an immense degree of variability, by design. To parse resource strings from a resource file that is implemented using a programming language, and where developers may get up to all sorts of tricks to satisfy pressing development requirements, can present ongoing problems.

A developer may, for example, choose to construct a resource string's value using a programmatic expression. Now, unless your resource file parser evaluates the resource file using the language in which it is written, it will not be able to obtain a resolved value for such an expression in order to send the resource string off for translation.

8.1.1.3. Candidate Formats

To satisfy a requirement to use a pure data format, any of the following standard file formats could be used...

	JSON (JavaScript Object Notation)
	XLIFF (XML Localization Interchange File Format)
	YAML (YAML Ain't Markup Language)
	Java .properties
	Mac Strings
	ARB (Application Resource Bundle) - essentially a JSON-based format

8.1.2. Additional Recommendations

In addition to the recommendation that a pure data format be used for resource files, the following additional recommendations are made...

8.1.2.1. A Clean Format

It is recommended that the format used for resource files be as clean and uncluttered as possible.

That is to say, there should be as little need as possible to quote and escape characters in resource strings. Some formats, such as XLIFF (XML-based) or JSON, require strings to be enclosed somehow. In the case of an XML-based format, the strings will either have to be enclosed in quotes as attribute values or enclosed in container XML tags. In either case, escaping to XML entitities will be required for characters in the resource strings that collide with the format's delimiter characters.

The same principle applies with a JSON-based format, where resource strings would be expressed as string literals enclosed in quotes, and escaping will be required for quote and linebreak characters inside the strings.

8.1.2.2. Narrowed List of Candidate Formats

To satisfy the additional requirements listed above, any of the following file formats could be used...

	YAML (YAML Ain't Markup Language)
	Java .properties
	Mac Strings

8.1.3. Resource String Metadata

The resource file format should allow optional metadata to be supplied for every resource string.

A generic system for specifying resource string metadata can serve a number of different purposes and may be used to capture the following types of information...

	resource string translatability
	resource string type
	resource string context
	resource string constraints
	resource string translation instructions

8.1.3.1. Resource String Translatability

Resource files can be used to contain resource strings that are not to be translated, such as URLs, e-mail addresses, phone numbers, color values, etc.

In cases where a resource string is not translatable, it is helpful to the translation process to be able to know this with certainty. The metadata facility can be leveraged to flag certain strings as being non-translatable strings by defining a metadata property for specifying translatability.

8.1.3.2. Resource String Type

Beyond merely indicating resource string translatability. the metadata facility can be used for supplying additional type information for resource strings that are serializations of complex object types.

With a mechanism for specifying resource string types, resource strings can be serializations of complex object types, such as color values. Type information that is stored as part of the metadata for strings can then be used by the resource strings runtime loader to instantiate the appropriate parser / de-serializer classes to turn the string serializations of the objects back into the objects.

8.1.3.3. Resource String Context

The metadata facility can be used to provide information on the context in which resource strings are used, which can then be used to assist the translators in choosing the best suited translation for the context.

Context information might describe that a resource string is to be used in a specific type of UI component, such as a calendar widget, or in a specific screen in a mobile application, or in a specific page or section in a Web application. Context information may also provide a URL for previewing the context in which the resource string would be used.

8.1.3.4. Resource String Constraints

The metadata facility can be used to provide information on any constraints that should be considered when translating a resource string to another language.

A common type of constraint that can be noted in metadata is a limit on the character length for a translation. This can be expressed in a number of possible ways...

	the translation is not to exceed a specified maximum number of characters in length
	the translation is not to exceed the length of the English source string (in case the UI layout can fit only as many characters as are in the existing English source string)
	the translation should be no more than a specified percentage longer than the English source string (e.g. maximum 20% longer than original)

8.1.3.5. Resource String Translation Instructions

The metadata facility can be used to provide general instructions to inform how a resource string should be translated.

As an example, it might be desirable in the UI layout for an English source string that is a single word to be translated as a single word. This hint can be included in translation instructions inside the meta data for the resource string. With this information, a translator can then choose a single word translation in a situation where there is an acceptable single word translation for the resource string, even if a multi-word translation would be slightly more natural for the user.

8.1.4. Non-translatable Strings

Non-translatable strings are resource strings that should not be translated.

Examples of non-translatable strings include...

	URLs
	e-mail addresses
	phone numbers
	color values
	CSS style properties
	dimensions
	etc.

Even though these strings may not be translatable, it is nevertheless practically useful to store them in resource files along with translatable

If automation scripts that prepare jobs for translation can detect resource strings that are non-translatable, then these strings can be excluded from the translation jobs that are sent to translators. In this way, one can eliminate the cost associated with the translators evaluating these strings for translation, since there may be a baseline processing fee even if a determination is made to leave the strings untranslated.

An additional and even more compelling benefit to knowing which resource strings are not to be translated is that these strings can be left as is by an automated pseudo-localization process, since inappropriately pseudo-localizing some non-translatable strings such as URLs may actually break parts of the functionality in a pseudo-localized version of an application, thereby hampering the efforts of QA testers.

UIZE JavaScript Framework