Brief. Heh. This document is 3,200 words long. Happy reading!
---------
For a long time, we at Infopop have had a very annoying and persistant problem. New customers keep moving over to us from other products, but we didn't have the resources to develop custom import engines.
The main problem with writing importers is that everyone does
something differently than everyone else. You can't just write one block of code and expect it to work for more than one product... or in some cases, even more than one version of any one product. Trying to keep multiple importers from multiple other products up to date with three of our own products would almost be a full time job.
A few months ago, the Infopop developers sat down and pondered this major problem. After looking at the database structures of a handful of other products, we determined that we
could write a "generic" import engine - it would be given the data to be imported, along with a file that declares how the data is arranged. The main problem would be that because it would need to be so generic, it could only bring over categories, forums, posts, and users. Almost everything else, including settings, styles, permissions, private messages, etc, is done so differently that it can't be done by a generic engine.
UBB.classic 6.7 is the first beta product to have the new generic importer. Eve/UBB.x will be the next, followed by UBB.threads. I'll be writing the UBB.threads importer engine. (Note, May 3, 2004 - this was true when I wrote this document, but is no longer. Please take this fact into account when reading the rest of the doc.

Rick did a good job with the 6.5 importer...)
By the end of the beta cycle, we hope to have format declarations for Invision, vBulletin, and phpBB. We already have experimental formats for all of our own products.
So, what's a format declaration? We settled on using
YAML to store format data for the importer. The YAML spec will contain a list of dumped files in the export, what format the file is, and then the identifiers for each field, along with a set of possible transformations that need to be applied to the data.
Files can be in two formats: full CSV (the MS Excel style: unquoted,"quoted","embedded """quotes""" here"), or delimited by a single character (i.e. "this|that|these|those"). Fair warning: The UBB.classic CSV parser is very stupid. It does not handle CSV with much grace at all. It's based on the
Text::CSV module from CPAN, which was last updated in 1997. All development since 1997 has gone into an XS version of the module... and as we can't compile anything on the server...
The UBB.classic and UBB.threads import engines can also natively read from a MySQL database. This feature is not in the original specification, and is not officially supported. Information on exactly what needs to be done to make that work will be released at a later time. The official method of importing is going to be through flat files.
Each file is expected to contain a series of columns with data. The data doesn't need to be processed in any way - the YAML spec will indicate what needs to be done with regexes. Let's take the following user file as an example...
Column one is the username; two is the password; three is the email. If the password is blank, then the importer will make one up. If the email is blank, then the importer makes one up. If there's no unique identifier for the user (i.e. a user number), then the user name is used instead.
This file would be represented by the following entry in the YAML spec:
- Delimiter: '|'
Fields:
- IP_Name: User.USERNAME
Index: 0
Key: y
Required: y
- IP_Name: User.PASSWORD
Index: 1
Key: n
Required: y
- IP_Name: User.EMAIL
Index: 2
Key: n
Required: y
Format: txt
Name: my_users.txt
The 'IP_Name' field is used to determine what the field really means internally. A complete list of IP_Name fields and their definitions is forthcoming. Probably next week.
The 'Index' field is used to determine which index the field is contained in, if the line were split into an array, starting with index 0.
The 'Key' field is where things get fun. Key fields are those that are referenced by other files in the data set. Let's use the following three files as an example:
1|My Login Name
2|Another Login Name
3|Yes, it's a Login Name
My PDN|1
Another PDN|2
Yes, it's s PDN|3
The first file contains the user ID, and the user login name. The second file contains the user PDN, and the user number. The third file contains the user login name, and the email address.
To import this data correctly, we have to ensure that the importer knows that both the user number AND the user name will be referenced in other files. We would do this by marking both the number and name fields in the first file as Keys. In the second file, only the user number would be a key. In the third file, only the login name would be a key.
(The UBB.classic importer can handle multiple levels of keys - the email could be a key to another file, which could then have a key of its own that's used in yet another file, etc. Thankfully, such impossibly stupid data structures don't really exist in the real world. The Eve/UBB.x importer can only handle key references one level deep, such as the one above. Also note that for all intents and purposes, any ID IP_Name is always treated as a key (Forum.ID, Category.ID, Post.ID, User.ID))
The last field as shown above is 'Required'. This field is ignored by the UBB.classic importer at this time.
To demonstrate the next possible field, let's declare some forums and categories:
Cat Name|1|no|
Forum Name|2|yes|1
Another Forum|3|yes|1
Another Cat|4|no|
Yet Another Forum|5|yes|4
Confused? You should be. This file declares both forums and categories at once. Both Eve/UBB.x and vBulletin do things this way. The only difference between a forum and a category is the second indexed field - it's "yes" if it's a forum, and "no" if not. Let's declare how we'd process categories from this file:
- Delimiter: '|'
Fields:
- IP_Name: Category.NAME
Index: 0
Key: n
Required: y
- IP_Name: Category.ID
Index: 1
Key: y
Required: y
- IP_Name: Category.NULL
Index: 2
Key: n
Required: y
Validate: "equals=no"
Format: txt
Name: my_categories.txt
The new field is 'Validate'. It takes multiple possible arguments, as a string. The string is basically a query string, and must be URL-encoded. Possible arguments for the query string are:
-
equals=$string - where $string is a literal string that this field must match exactly
-
notequals=$string - where $string is a literal string that this field must NOT match exactly
-
regex=$regex - where $regex is a string in the form "/$pattern/$modifiers", where $pattern is a normal regex, and $modifiers are any modifiers to the m// function. An example might be "/^no$/".
Multiple Validate tests can occur. For instance: "regex=/w/®ex=/d+/"
If any one Validate test fails, then the entire record as read from that file is discarded. (This is something of a bug in the UBB.classic implementation - the Validate check currently occurs during a file read, but it really needs to occur when all of the data has been read, as to ensure that the ENTIRE record gets discarded, not just what's listed in this one file.)
Here's a snippet of the YAML for importing forums out of that same set of data:
- Delimiter: '|'
Fields:
- IP_Name: Forum.NAME
Index: 0
Key: n
Required: y
- IP_Name: Forum.ID
Index: 1
Key: y
Required: y
- IP_Name: Forum.NULL
Index: 2
Key: n
Required: y
Validate: "equals=yes"
- IP_Name: Forum.CATEGORY_ID
Index: 3
Key: n
Required: y
Format: txt
Name: my_forums.txt
Note how the Validate check on index two has changed.
As an alternative, I could also have ignored index two entirely, and did a Validate on index three. In this case, if the forum category ID is blank, then it's a category... and if it contains numbers, then it's a forum.
Also of note, while I'm in the area - the relationship between Category.ID and Forum.CATEGORY_ID is assumed. Forum.CATEGORY_ID does not need to be marked a key.
Let's move on to the final type of field. As an example, let's use a topic file:
1|1|0|Hello there!|This is the topic body from user one|1|1072480310|2
1|2|1|Re: Hello there!|This is the first reply to topic one from user two.|2|1072480311|2
1|3|1|Re: Hello there!|This is the second reply to topic one from user one.|1|1072480312|2
1|4|2|Re: Hello there!|This is the third reply to topic one, via post two.|2|1072480313|2
2|5|0|This is the second topic!|User three
was here!|3|1072480314|3
Field one is the topic ID. Field two is the post ID. Field three is the post ID to which this post was a reply to. Field four is the topic subject. Field five is the topic body. Field six is the user number that made the post. Field seven is the unix timestamp for that post. The last field is the forum ID.
Here's how this would be expressed in the YAML:
- Delimiter: '|'
Fields:
- IP_Name: Post.TOPIC_ID
Index: 0
Key: y
Required: y
- IP_Name: Post.ID
Index: 1
Key: y
Required: y
- IP_Name: Post.REPLY_TO_ID
Index: 2
Key: n
Required: y
- IP_Name: Post.SUBJECT
Index: 3
Key: n
Required: y
- IP_Name: Post.BODY
Index: 4
Key: n
Required: y
- IP_Name: Post.USER_ID
Index: 5
Key: n
Required: y
Pattern: "regex=%2F%3Cbr%20%5C%2F%3E%2F%5Cn%2Fg"
- IP_Name: Post.DATETIME_POSTED
Index: 6
Key: n
Required: y
Pattern: "epoch=S"
- IP_Name: Post.FORUM_ID
Index: 7
Key: n
Required: y
Format: txt
Name: my_posts.txt
Here we see the new 'Pattern' field. Like a Validate field, it takes a URL-encoded query string. The available arguments are a bit different, however:
-
regex=$regex - takes a regex as would be passed to the s/// function. In the above example, the regex is "/
/n/g", which turns all
tags into newlines.
-
epoch=(S|MS) - defines this field as an epoch date. If the argument is S, the epoch is seconds since 1970, like in Perl. If the argument is MS, it's milliseconds, like in Java.
-
sdf=$string - defines this field as a
Simple Date Format date string. Note that UBB.classic can not currently parse SDF - instead, any SDF fields are passed off into the newly distributed Date::Parse module, which can make sense of just about anything. SDF support is being built into the next release.
All regexes are run before date processing. All Patterns are run before Validates.
Also note that the regexes should be in the "///" format. Alternative delimiters (i.e. "~~~" or "{}{}") are not supported by Eve/UBB.x. Sorry. This rule also applies to Validate regexes.
As for the topic data itself... some items of note. Like Forum.CATEGORY_ID and Category.ID, the relationship between Post.USER_ID and User.ID (or User.USERNAME) is assumed. The same is true for Forum.ID and Post.FORUM_ID.
Also, the relationship of Post.ID and Post.TOPIC_ID is assumed. All posts must have a TOPIC_ID. If it's missing or blank, the post is assumed to be the first post in a new topic. If there are multiple files containing post data, the one containing keys for ID <=> TOPIC_ID will try to get imported first to try to keep things consistant.
REPLY_TO_ID support is limited in UBB.classic. All topics
must be identified distinctly. I.e. you can't just have ID and REPLY_TO_ID - you must also have TOPIC_ID. In fact, you can skip REPLY_TO_ID altogether at the current time. This is considered a bug, but the overhaul required to get it working was too much this late in the game. Look for proper REPLY_TO_ID support in the next major release, and in the UBB.threads implementation.
So, that's about it as far as the importer is concerned. If you're confused and overwhelmed, don't worry... that's not entirely unexpected.
For more on the importer, I suggest reading the code - it starts at line 1140 of cp2_content.cgi
Now, for the exporter... it produces a simple |-delimited dump of all the data that it is capable of exporting. The first line of the first of each type of dumped file will contain a list of fields. (The files are size-limited to half a meg each. Only the posts files will ever grow that large. Only the first line of the first file in the set will contain a list of IP_Names.)
You should get a good feeling for what can be supported by the importer by running an export and examining the files. Everything that gets exported can also be imported again. Here's the YAML spec to import data from one UBB.classic into another:
--- #YAML:1.0
Files:
- Delimiter: '|'
Fields:
- IP_Name: Category.ID
Index: 0
Key: y
Required: y
- IP_Name: Category.NAME
Index: 1
Key: n
Required: y
- IP_Name: Category.DESCRIPTION
Index: 2
Key: n
Required: y
- IP_Name: Category.THREADING_ORDER
Index: 3
Key: n
Required: y
Format: txt
Name: ubbc_categories.txt
- Delimiter: '|'
Fields:
- IP_Name: Forum.ID
Index: 0
Key: y
Required: y
- IP_Name: Forum.CATEGORY_ID
Index: 1
Key: n
Required: y
- IP_Name: Forum.DESCRIPTION
Index: 2
Key: n
Required: y
- IP_Name: Forum.INTRO
Index: 3
Key: n
Required: y
- IP_Name: Forum.PASSWORD
Index: 4
Key: n
Required: y
- IP_Name: Forum.NAME
Index: 5
Key: n
Required: y
- IP_Name: Forum.VISIBLE
Index: 6
Key: n
Required: y
- IP_Name: Forum.TYPE
Index: 7
Key: n
Required: y
- IP_Name: Forum.IS_HTML_ALLOWED
Index: 8
Key: n
Required: y
- IP_Name: Forum.IS_PRIVATE
Index: 9
Key: n
Required: y
- IP_Name: Forum.IS_ENABLED
Index: 10
Key: n
Required: y
- IP_Name: Forum.IS_READ_ONLY
Index: 11
Key: n
Required: y
- IP_Name: Forum.IS_UBB_CODE_ALLOWED
Index: 12
Key: n
Required: y
- IP_Name: Forum.IS_UBB_CODE_IMAGES_ALLOWED
Index: 13
Key: n
Required: y
- IP_Name: Forum.IS_TOPIC_ALLOWED
Index: 14
Key: n
Required: y
- IP_Name: Forum.IS_POLLING_ENABLED
Index: 15
Key: n
Required: y
- IP_Name: Forum.SORTING_ORDER
Index: 16
Key: n
Required: y
- IP_Name: Forum.THREADING_ORDER
Index: 17
Key: n
Required: y
Format: txt
Name: ubbc_forums.txt
- Delimiter: '|'
Fields:
- IP_Name: User.USERNAME
Index: 0
Key: n
Required: y
- IP_Name: User.PASSWORD
Index: 1
Key: n
Required: y
- IP_Name: User.ID
Index: 2
Key: y
Required: y
- IP_Name: User.DISPLAY_NAME
Index: 3
Key: n
Required: y
- IP_Name: User.EMAIL
Index: 4
Key: n
Required: y
- IP_Name: User.DISPLAY_EMAIL
Index: 5
Key: n
Required: y
- IP_Name: User.TITLE
Index: 6
Key: n
Required: y
- IP_Name: User.OCCUPATION
Index: 7
Key: n
Required: y
- IP_Name: User.INTERESTS
Index: 8
Key: n
Required: y
- IP_Name: User.LOCATION
Index: 9
Key: n
Required: y
- IP_Name: User.SIGNATURE
Index: 10
Key: n
Required: y
- IP_Name: User.PARENT_EMAIL
Index: 11
Key: n
Required: y
- IP_Name: User.CUSTOM_1
Index: 12
Key: n
Required: y
- IP_Name: User.CUSTOM_2
Index: 13
Key: n
Required: y
- IP_Name: User.CUSTOM_3
Index: 14
Key: n
Required: y
- IP_Name: User.CUSTOM_4
Index: 15
Key: n
Required: y
- IP_Name: User.IP_AT_REGISTRATION
Index: 16
Key: n
Required: y
- IP_Name: User.DOB
Index: 17
Key: n
Required: y
Pattern: 'sdf=E%20M%20d%20k%3Am%3As%20y'
- IP_Name: User.REGISTRATION_DATE
Index: 18
Key: n
Required: y
Pattern: 'sdf=E%20M%20d%20k%3Am%3As%20y'
- IP_Name: User.LAST_LOGIN_DATETIME
Index: 19
Key: n
Required: y
Pattern: 'sdf=E%20M%20d%20k%3Am%3As%20y'
- IP_Name: User.LAST_POST_DATETIME
Index: 20
Key: n
Required: y
Pattern: 'sdf=E%20M%20d%20k%3Am%3As%20y'
- IP_Name: User.HAS_OPTED_OUT_OF_EMAILS
Index: 21
Key: n
Required: y
- IP_Name: User.ALLOW_PRIVATE_MESSAGES
Index: 22
Key: n
Required: y
- IP_Name: User.NOTIFY_ON_PRIVATE_MESSAGES
Index: 23
Key: n
Required: y
- IP_Name: User.CAN_USE_AVATARS
Index: 24
Key: n
Required: y
- IP_Name: User.IGNORE_AVATARS
Index: 25
Key: n
Required: y
- IP_Name: User.CAN_PARTICIPATE_IN_POLLS
Index: 26
Key: n
Required: y
- IP_Name: User.IS_VALIDATED
Index: 27
Key: n
Required: y
- IP_Name: User.IS_AGE_RESTRICTED_USER
Index: 28
Key: n
Required: y
- IP_Name: User.IS_ADMINISTRATOR
Index: 29
Key: n
Required: y
- IP_Name: User.IS_PROFILE_LOCKED
Index: 30
Key: n
Required: y
- IP_Name: User.IS_PRIVATE_MESSAGING_DISABLED
Index: 31
Key: n
Required: y
- IP_Name: User.IS_DOB_HIDDEN
Index: 32
Key: n
Required: y
- IP_Name: User.IS_ACTIVITY_HIDDEN
Index: 33
Key: n
Required: y
- IP_Name: User.IS_AVATAR_LOCKED
Index: 34
Key: n
Required: y
- IP_Name: User.IS_BANNED
Index: 35
Key: n
Required: y
- IP_Name: User.USER_POST_COUNT
Index: 36
Key: n
Required: y
- IP_Name: User.CUMULATIVE_USER_POST_COUNT
Index: 37
Key: n
Required: y
- IP_Name: User.DAY_PRUNE
Index: 38
Key: n
Required: y
- IP_Name: User.PICTURE_URL
Index: 39
Key: n
Required: y
- IP_Name: User.HOME_PAGE_URL
Index: 40
Key: n
Required: y
- IP_Name: User.AVATAR_URL
Index: 41
Key: n
Required: y
Format: txt
Name: ubbc_users.txt
- Delimiter: '|'
Fields:
- IP_Name: Post.ID
Index: 0
Key: y
Required: y
- IP_Name: Post.TOPIC_ID
Index: 1
Key: n
Required: y
- IP_Name: Post.FORUM_ID
Index: 2
Key: n
Required: y
- IP_Name: Post.USER_ID
Index: 3
Key: n
Required: y
- IP_Name: Post.BODY
Index: 4
Key: n
Required: y
- IP_Name: Post.SUBJECT
Index: 5
Key: n
Required: y
- IP_Name: Post.USERNAME
Index: 6
Key: n
Required: y
- IP_Name: Post.GUEST_AUTHOR
Index: 7
Key: n
Required: y
- IP_Name: Post.DATETIME_POSTED
Index: 8
Key: n
Required: y
Pattern: 'sdf=E%20M%20d%20k%3Am%3As%20y'
- IP_Name: Post.POSTER_IP
Index: 9
Key: n
Required: y
- IP_Name: Post.IS_SIGNATURE_APPENDED
Index: 10
Key: n
Required: y
- IP_Name: Post.IS_TOPIC_CLOSED
Index: 11
Key: n
Required: y
- IP_Name: Post.IS_GUEST
Index: 12
Key: n
Required: y
Format: txt
Name: ubbc_posts.txt
Metadata:
- Comments: Hand created export spec file
Export_Application: Import for UBB.classic 6.7+
Import_Application: UBB.classic
Timestamp: 12-05-2003
Version: 1
I think that about does it. Some final notes...
The importer assigns completely random category/forum/post/topic/member IDs. Importing an export
will result in completely different user and post numbers. This is not an effective backup tool for that reason.
Duplicate member records will not be imported. If a user with the same login name (User.USERNAME) exists, then all imported posts will be attributed to him/her.
Sorry, there are no date restriction controls on the import/export. It's an all-or-nothing thing. Likewise, there are no forum controls (i.e. "export this forum only" or "import this forum only").
The export process will increase your disk space use by about 2/3. The import process will increase your disk space use by about 300% while it's running, but only about doubles it when it's done deleting all the temporary data.
Sorry, there is no support for data compression, either ingoing or outgoing.
Importing about 50,000 posts and 5,000 users takes about an hour on a decent sever.
The UBB.classic import process is not memory-friendly. Restrictions on some service providers may cause imports to fail. Unfortunately, I don't have too much control over this. The UBB.threads implementation will not have this issue.
If you need to split up the number of messages being imported, you must attempt to import members from both batches. Only the first batch will actually import members. You need to attempt to import members during all subsequent batches so that user data can be properly reassociated. (For instance, if everything goes by a unique user ID, there's no way to associate an imported user ID with the post-import user ID once the import completes. Because members aren't duplicated, the association with the existing imported member will occur instead.)
Clear as mud? Good.
Questions?
