Difference between revisions of "Sending and receiving data"

From Overbyte
Jump to navigation Jump to search
(→‎Quoted-Printable: add ctrl char like in the other paragraphs)
 
(31 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Terminology ==
+
[[Main_Page | Main page]] -> [[FAQ]] -> [[Sending and receiving data]]
  
We use the term ''binary data'' not to make a difference to ''ASCII'' or human readable data, but for data that has an unpredictable content, meaning we cannot use  LineMode just like that because whatever combination of characters you choose it ''could'' be in the data stream itself.
+
== General ==
 +
 
 +
There are several ways to transfer data. No matter if it is a file, an image, textual or raw data, it is all the same. What is needed is a common language between the data sender and the data receiver, in order for each to understand the other. This language defines a format. Formatting data is a way to "shape" it, describing its beginning and its end, in order for the receiver to know where data starts and ends.
 +
 
 +
<!--
 +
We use the term ''binary data'' not to make a difference to ''ASCII'' or human readable data, but for data that has an unpredictable content, meaning we cannot use  line mode just like that because whatever combination of characters you choose it ''could'' be in the data stream itself.
  
 
== Solutions ==
 
== Solutions ==
 +
-->
 +
 +
Many formats are of course possible, mainly depending upon the kind of data itself. Basically, we can consider that we have two kinds of data :
 +
* predictable data content
 +
* unpredictable data content
 +
 +
In the first case, we know what kind of content the data contains - ie the byte set that can be used is known (byte values from 32 to 127, for example). In the other case, the data content is not known - ie the byte set can be any possible byte value (from 0 to 255). This major difference will mostly determine the format of the data to be exchanged, defining then the protocol.
 +
 +
Concerning Predictable Data Content, as the content values are known by advance - a subset of the 256 possible byte values, it is possible to know which byte values '''cannot''' be part of the data. This hint let us the possibility to create a delimiter made of one or more of these "extra data" byte values. This delimiter, sent after data, may then help us to know where data ends. This mode where a delimiter is placed at the end of data is named LineMode.
 +
 +
Concerning Unpredictable Data Content, as the content values are never known by advance - byte set values starts at 0 and ends at 255, if we plan to send several data packet (for example several records) at once, then we have to tell the receiver the size of the data we're going to send for each packet. This mode to "format" data packet by prefixing it with its size is call PacketMode.
 +
 +
And if we plan to send at the same time some Predictable Data Content and some Unpredictable Data Content, we can then imagine a combination of both mode.
 +
 +
[[TWSocket]] has the ability to help you in implementing such a protocol. It is able to manage data using [[TWSocket.LineMode | LineMode]] and PacketMode (which is Non-LineMode).
 +
 +
== Line mode ==
 +
 +
Line mode is the most widely used, specially with ''ascii'' based protocols. But there are many protocols where line mode is used with unpredictable data. In that case that data has to converted into a specific range of bytes. The most used encoding mechanism are described here.
  
There are many solutions possible, and dont just pick one. You have to think carefully on what the data is, and what to do with it. Many time a 'log' is named. This is not nececarely a logfile to peep to the data, but (specially if thirth parties involved) many time development time can be twice as long because of a protocol / format error that is not properly logged.
+
=== Base-64 ===
  
Think very carefull if you want a solution with or without LineMode. If you for example send a GigaByte of data and decide to use LineMode, that TWSocket will concatenate the whole Gigabyte of data before OnDataAvailable is called, consuming valuable memory.
+
''Base64'' is an encoding process using a 64 letter alphabet where each letter representing 6 bits in the input stream. It is described in RFC [http://rfc.net/rfc2045.html#p24 2045]. All encoding and decoding procedures can be found in unit MimeUtil. The encoded data is about 33 percent larger and not human readable. Every character not used in the Base64 alphabet can be used as control character, including line end.
  
<font color="blue">OK maybe this is better:</font>
+
=== ASCII-hex ===
  
Many solutions are of course possible. You can use LineMode or packet data. Using LineMode has the advantage that TWSocket will concatenate packets for you but the disadvantage that it will hold all data in memory before it will fire OnDataAvailable. The latter is of course only importand if packets are very large. If you choose for packet data then you have to concatenate data yourself but you have control on what is kept in memory or not.
+
''ASCII-hex'' is used in many protocols. Every character is converted into his hexadecimal equivalent and sent as such. For example the string '123' is sent as '313233'. The encoded data is twice as long and difficult human readable. Every character except 0..9, A..F can be used as control character, including line end.
  
=== LineMode ===
+
=== Escaping ===
  
==== Base-64 ====
+
Escaping is very often used. Control characters including line end has to be chosen in a way they are as less as possible in the original data. The principle is to precede the control characters with an escape character and replace them by other characters. Very often a NULL character is escaped as well. Most of the time the encoded data is only a little longer than original and good human readable.
  
''Base64'' is an easy solution as every character is encoded, and if you for example choose for #13#10 as line end then  the whole encoded stream is received in one chunck. [[Todo: Link to the components here]]. Advantage is that all encoding / decoding is on board with ICS. Disavantage is that the data is larger and unreadable in logs.
+
=== Quoted-Printable ===
 +
   
 +
The Quoted-Printable encoding as specified in RFC [http://RFC.net/rfc1521.html#p18 1521] is intended to represent data which is largely human readable. It encodes certain byte ranges into their hexadecimal presentation. The encoded data remains good human readable. Procedures to encode and decode Quoted-Printable can be found in unit MimeUtil. Every encoded character can be used as control character, including line end.
  
==== Ascii-hex ====
+
<!--
 +
The Quoted-Printable encoding as specified in RFC [http://RFC.net/rfc1521.html#p18 1521] is intended to represent data that largely consists of bytes that correspond to printable characters in the ASCII character set. It encodes the data in such a way that the resulting bytes are unlikely to be modified by mail transport. If the data being encoded are mostly ASCII text, the encoded form of the data remains largely recognizable by humans. Procedures to encode and decode Quoted-Printable can be found in unit MimeUtil.
 +
-->
  
''Ascii-hex'' is used in many protocols. Every character is converted into his hexadecimal equivalent and sent as such. For example the string '123' is sent as '313233'. Disavantage is that the data is twice as long and difficult to read in logs (depending on the length and contents).
+
== Packet mode ==
  
==== Escaping ====
+
Packet mode is also very widely used and most of the time designed for a specific protocol. In packet mode the data has a specific structure which eventually can be different depending on the type of packet even within the same protocol. The most universal ones are explained here.
  
Escaping is very often used. The principle is to preceed the ''forbitten'' characters with an escape character and replace them by (an)other character(s). ''Forbitten'' characters are the ''LineEnd'' character, the escape character, and in many cases also a NULL character. Advantage is that the data is only a little longer than original and that log is very good reading. The latter can be of importance if the data contains many readable literals, also in this case often all unprintable characters are choosed to be escapen.
+
=== Header ===
  
Often used in many protocols is choosing for escape character 0x1B, and set high bit of the forbitten characters. While the set / reset is only a bit manipulation it is very CPU friendly, but choosing for 0x1B is mostly a bad choice in binary data, unless it is predictable that this character is not often used in it. Another disavantage is the bad reading of the escaped characters in a log.
+
A very often used technique is to precede the data with a header. Presides other control information this header has a field containing the length of the data.
  
Less often used in communications but in some cases better is to use readable characters for the escape as whell for the replacements. For example you can use '\' as escape character and 'c' for 0x13 and 'n' for 0x10 as it is a whell known behaviour in C programming language.
+
The length field is 1, 2 or 4 bytes long. Note that it is common habitude in communications to represent numbers in [[Endian | Big Endian]] format while Intel CPU use by design [[Endian | Little Endian]] format.
  
When data is containing many unreadable characters mixed with readable characters, and it has to be logged then it could be better to do a kind of a mix of previous techniques. Forbitten characters as whell as non printable characters can be displayed in hex-ascii and preceided by an escape character. For example 'hello '#10 will look like: 'hello%20%0A' if we choose for '%' as escape character.
+
Less used but also a very good technique is to represent the length field in hexadecimal format of 2, 4 or 8 bytes. Advantage is that it is human readable.
  
=== Packet data ===
+
=== Fixed length data ===
  
==== Preceiding each data packet with his length ====
+
Fixed length data is where all packets of the same type have the same structure and length. Mostly the type of the structure is indicated in a field somewhere at the beginning. If such a structure contains a data field of variable length then eater the data is padded or its length is indicated in a field in the structure.
  
This is a very common used technique. The first 1, 2 o 4 bytes of the data represent the length of the packet. Note that it is habitude in communications to use [[Big Endian]] format while Intel CPU use [[Little Endian]] format. Of course if you design your own proto you can use the endian format of your choise.
+
== Mixed mode ==
  
Less used but still practice is to give the preceiding length in hex format of 2, 4 or 8 bytes. Advantage is that it is very readable in logs.
+
[[TWSocket]] has the ability to implement a mixed mode protocol by switching [[TWSocket.LineMode | LineMode]] during negotiation.
  
==== Fixed length data ====
+
=== Example ===
  
=== Mixed mode ===
+
* ''A'' want to send data, so he tell it to ''B'' during negotiation, with ''ascii'' commands using [[TWSocket.LineMode | LineMode]], including the size.
 +
* After receiving that information, ''B'' switch [[TWSocket.LineMode | LineMode]] off then tell it to ''A''.
 +
* When ''A'' receive this confirmation it start sending exact what has negotiated.
 +
* When ''B'' has received it all, then he switch [[TWSocket.LineMode | LineMode]] back on then tell it to ''A''.
 +
* After receiving that information both are again in negotiation phase.
  
 
== Conclusion ==
 
== Conclusion ==
  
 
Difficult to explain something :)
 
Difficult to explain something :)
 +
 +
 +
[[User:Wilfried|Wilfried]] 20:18, 21 February 2006 (CET)

Latest revision as of 09:32, 27 February 2006

Main page -> FAQ -> Sending and receiving data

General

There are several ways to transfer data. No matter if it is a file, an image, textual or raw data, it is all the same. What is needed is a common language between the data sender and the data receiver, in order for each to understand the other. This language defines a format. Formatting data is a way to "shape" it, describing its beginning and its end, in order for the receiver to know where data starts and ends.


Many formats are of course possible, mainly depending upon the kind of data itself. Basically, we can consider that we have two kinds of data :

  • predictable data content
  • unpredictable data content

In the first case, we know what kind of content the data contains - ie the byte set that can be used is known (byte values from 32 to 127, for example). In the other case, the data content is not known - ie the byte set can be any possible byte value (from 0 to 255). This major difference will mostly determine the format of the data to be exchanged, defining then the protocol.

Concerning Predictable Data Content, as the content values are known by advance - a subset of the 256 possible byte values, it is possible to know which byte values cannot be part of the data. This hint let us the possibility to create a delimiter made of one or more of these "extra data" byte values. This delimiter, sent after data, may then help us to know where data ends. This mode where a delimiter is placed at the end of data is named LineMode.

Concerning Unpredictable Data Content, as the content values are never known by advance - byte set values starts at 0 and ends at 255, if we plan to send several data packet (for example several records) at once, then we have to tell the receiver the size of the data we're going to send for each packet. This mode to "format" data packet by prefixing it with its size is call PacketMode.

And if we plan to send at the same time some Predictable Data Content and some Unpredictable Data Content, we can then imagine a combination of both mode.

TWSocket has the ability to help you in implementing such a protocol. It is able to manage data using LineMode and PacketMode (which is Non-LineMode).

Line mode

Line mode is the most widely used, specially with ascii based protocols. But there are many protocols where line mode is used with unpredictable data. In that case that data has to converted into a specific range of bytes. The most used encoding mechanism are described here.

Base-64

Base64 is an encoding process using a 64 letter alphabet where each letter representing 6 bits in the input stream. It is described in RFC 2045. All encoding and decoding procedures can be found in unit MimeUtil. The encoded data is about 33 percent larger and not human readable. Every character not used in the Base64 alphabet can be used as control character, including line end.

ASCII-hex

ASCII-hex is used in many protocols. Every character is converted into his hexadecimal equivalent and sent as such. For example the string '123' is sent as '313233'. The encoded data is twice as long and difficult human readable. Every character except 0..9, A..F can be used as control character, including line end.

Escaping

Escaping is very often used. Control characters including line end has to be chosen in a way they are as less as possible in the original data. The principle is to precede the control characters with an escape character and replace them by other characters. Very often a NULL character is escaped as well. Most of the time the encoded data is only a little longer than original and good human readable.

Quoted-Printable

The Quoted-Printable encoding as specified in RFC 1521 is intended to represent data which is largely human readable. It encodes certain byte ranges into their hexadecimal presentation. The encoded data remains good human readable. Procedures to encode and decode Quoted-Printable can be found in unit MimeUtil. Every encoded character can be used as control character, including line end.


Packet mode

Packet mode is also very widely used and most of the time designed for a specific protocol. In packet mode the data has a specific structure which eventually can be different depending on the type of packet even within the same protocol. The most universal ones are explained here.

Header

A very often used technique is to precede the data with a header. Presides other control information this header has a field containing the length of the data.

The length field is 1, 2 or 4 bytes long. Note that it is common habitude in communications to represent numbers in Big Endian format while Intel CPU use by design Little Endian format.

Less used but also a very good technique is to represent the length field in hexadecimal format of 2, 4 or 8 bytes. Advantage is that it is human readable.

Fixed length data

Fixed length data is where all packets of the same type have the same structure and length. Mostly the type of the structure is indicated in a field somewhere at the beginning. If such a structure contains a data field of variable length then eater the data is padded or its length is indicated in a field in the structure.

Mixed mode

TWSocket has the ability to implement a mixed mode protocol by switching LineMode during negotiation.

Example

  • A want to send data, so he tell it to B during negotiation, with ascii commands using LineMode, including the size.
  • After receiving that information, B switch LineMode off then tell it to A.
  • When A receive this confirmation it start sending exact what has negotiated.
  • When B has received it all, then he switch LineMode back on then tell it to A.
  • After receiving that information both are again in negotiation phase.

Conclusion

Difficult to explain something :)


Wilfried 20:18, 21 February 2006 (CET)