Photo by Kelly Sikkema on Unsplash
How to parse fixed-length data and why you should avoid 'String.substring'
Introduction
This article is about exchanging data with a fixed-length data format. It will tell you about the pros and cons of this data format. It shows and demonstrates an implementation in Dart that is easy to read and maintain. The implementation supports characters that consist of more than one code unit like e.g. emojis. It will also show that the standard functions like String.length
and String.substring
may fail on emojis.
Pros and Cons of the fixed-length data format
Data can be interchanged between systems in many different formats. The most well known format nowadays is json, but other popular formats are xml, csv and fixed-length.
This article is about the fixed-length data format. It has some advantages over the other ones.
- No need to load all data into memory before the data can be used. This is especially useful when importing large datasets. The data can be read and processed in chunks.
- No need to use escape characters, like you have to do with csv files. With csv files, there is always a problem when you want to use the character in the data that is also used to separate the values.
Of course, there are also downsides to using a fixed-length data format.
- The sender and the receiver have to agree on the order of values and length of each value.
- Each field has to be padded with trailing spaces or leading zeroes.
- Each field is (obviously) fixed in length. An increase in length would need work on both the source and destination.
Data definition
It is important to document the format, so the source and destination are both aware of the format used. A simple document could look like this:
Field | Type | Length | Padding |
first_name | char | 10 | Right with spaces |
last_name | char | 10 | Right with spaces |
age | integer | 3 | Left with zeroes |
city | char | 15 | Right with spaces |
country | char | 20 | Right with spaces |
Sample data
----------------------------------------------------------
1234567890123456789012312345678901234512345678901234567890
Sander Roest 049Rotterdam The Netherlands
Sandra Roest 042Rotterdam The Netherlands
Jeffrey Roest 009Rotterdam The Netherlands
Lucas Roest 007Rotterdam The Netherlands
----------------------------------------------------------
Implementation in Dart
Writing the code to parse fixed-length data seems like an easy job. At first, it looks like you just have to substring all the fields out of the data. This is in fact true, but it might get messy and difficult to maintain when the data definition changes.
Another issue to consider is that the String.length
and String.substring
functions might not work in the way you think.
The String
class works with code units. This means that you will get the length of a string in code units and not characters.
You can read all about it in this excellent post Dart string manipulation done right π.
To overcome both problems, you can use this helper class:
class FixedLengthParser {
FixedLengthParser(String value) : _characters = value.characters;
final Characters _characters;
var _index = 0;
String getByLength(int length) {
var value = _characters.getRange(_index, _index + length);
_index += length;
return value.string.trim();
}
}
The usage of this class makes it easy to match the code with the data definition. If the length of a field changes, you will have to change it only in one place.
final parser = FixedLengthParser(line);
final firstName = parser.getByLength(10);
final lastName = parser.getByLength(10);
final age = parser.getByLength(3);
final city = parser.getByLength(15);
final country = parser.getByLength(20);
Dartpad sample
With the dartpad sample you will be able to:
- Test the
FixedLengthParser
class (renamed toFixedLengthParserCharacters
) - Observe that String.substring fails on Emojis
- See a fun usage of the new constructor-tear-off functionality in Dart to use the same code with a working (characters) and a failing (string) implementation.
https://dartpad.dev/?id=f4e6612c9f3ef3b5eb3e439923fe8a46
The output looks like this where you can see that the String
implementation fails on emojis.
Parse the data using characters:
-------------------------------------------------
Firstname (10): 1234567890
Lastname (10): 1234567890
Age (3): 123
City (15): 123456789012345
Country (20): 12345678901234567890
Firstname (6): Sander
Lastname (5): Roest
Age (3): 049
City (9): Rotterdam
Country (15): The Netherlands
Firstname (10): π₯π₯π₯π₯π₯π₯π₯π₯π₯π₯
Lastname (10): π₯π₯π₯π₯π₯π₯π₯π₯π₯π₯
Age (3): πππ
City (15): πππππππππππππππ
Country (20): π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±π³π±
Parse the data using string (faulty):
-------------------------------------
Firstname (10): 1234567890
Lastname (10): 1234567890
Age (3): 123
City (15): 123456789012345
Country (20): 12345678901234567890
Firstname (6): Sander
Lastname (5): Roest
Age (3): 049
City (9): Rotterdam
Country (15): The Netherlands
Firstname (5): π₯π₯π₯π₯π₯
Lastname (5): π₯π₯π₯π₯π₯
Age (2): π₯οΏ½
City (8): οΏ½π₯π₯π₯π₯π₯π₯π₯
Country (10): π₯πππππππππ
Happy parsing!